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Abstract 

Value iteration is a popular algorithm for finding near optimal policies for POMDPs. 
It is inefficient due to the need to account for the entire belief space, which necessitates 
the solution of large numbers of linear programs. In this paper, we study value iteration 
restricted to belief subsets. We show that, together with properly chosen belief subsets, 
restricted value iteration yields near-optimal policies and we give a condition for determin- 
ing whether a given belief subset would bring about savings in space and time. We also 
apply restricted value iteration to two interesting classes of POMDPs, namely informative 
POMDPs and near-discernible POMDPs. 

1. Introduction 

Partially Observable Markov Decision Processes (POMDPs) provide a general framework 
for sequential decision-making tasks where the effects of an agent's actions are nondetermin- 
istic and the states of the world or environment are not known with certainty. Due to the 
model generality, POMDPs have found a variety of potential applications in reality (Mon- 
ahan, 1982; Cassandra, 1998b). However, solving POMDPs is computationally intractable. 
Extensive efforts have been devoted to developing efficient algorithms for finding solutions 
to POMDPs (Parr & Russell, 1995; Cassandra, Littman, &: Zhang, 1997; Cassandra, 1998a; 
Hansen, 1998; Zhang, 2001). 

Value iteration is a popular algorithm for solving POMDPs. Two central concepts in 
value iteration are belief state and value function. A belief state, a probability distribution 
over the state space, measures the probability that the environment is in each state. All 
possible belief states constitute a belief space. A value function specifies a payoff or cost 
for each belief state in the belief space. Value iteration proceeds in an iterative fashion. 
Each iteration, referred to as a dynamic programming (DP) update, computes a new value 
function from the current one. When the algorithm terminates, the final value function is 
used for the agent's action selection. Value iteration is computationally expensive because, 
at each iteration, it updates the current value function over the entire belief space, which 
necessitates the solution of a large number of linear programs. 

One generic strategy to accelerate value iteration is to restrict value iteration, that is, DP 
updates, to a subset of the belief space. For simplicity, a subset of the belief space is referred 
to as belief subset. Existing value iteration algorithms working with belief subsets include a 
family of grid-based algorithms where DP updates calculate values for a finite grid (Lovejoy, 
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1991; Hauskrecht, 1997; Zhou & Hansen, 2001), and several (maybe anytime) algorithms 
where DP updates calculate values for a growing belief subset (Dean, Kaelbling, Kirman, 
& Nicholson, 1993; Washington, 1997; Hansen & Ziberstein, 1998; Hansen, 1998; Bonet & 
Geffner, 2000). By restricting value iteration into a belief subset, the complexity of value 
functions is reduced and also DP updates are more efficient. These advantages have been 
observed by several researchers (Hauskrecht & Fraser, 1998; Roy & Gordon, 2002; Zhang 
& Zhang, 2001b; Pineau, Gordon, & Thrun, 2003). 

A fundamental issue in restricted value iteration is how to select a belief subset. The 
efficiency of value iteration and the quality of its generated value functions strongly depend 
on the selected belief subset. In one extreme case, if the subset is chosen to be a singleton 
set, value iteration is efficient, but the quality of value functions can be arbitrarily poor. 
In the other extreme case, if the subset is the belief space, the quality of value functions 
is retained, but the algorithm is inefficient. There exists a tradeoff between the size of the 
belief subset and the quality of value functions. 

In this paper, we show that it is indeed possible for value iteration to not only work with 
a belief subset but also retain the quality of value functions. This is achieved by deliberately 
selecting a belief subset for value iteration. Sometimes, we refer to the algorithm working 
with our selected belief subset as subset value iteration. (For distinction, restricted value 
iteration refers to value iteration working with any belief subset.) The efficiency of subset 
value iteration depends on the size of the selected subset. We characterize a condition to a 
priori determine whether the subset is proper ^ with respect to the belief space for a given 
POMDP. If this is the case, subset value iteration carries the space and time advantages. 

We also study two special POMDP classes, namely informative POMDPs and near- 
discernible POMDPs. An informative POMDP assumes that an agent has a good albeit 
imperfect idea about world states at any time point. For an informative POMDP, there 
exists a natural belief subset so that value iteration restricted to it can be more efficient 
than standard value iteration (Zhang & Liu, 1997). A near-discernible POMDP assumes 
that an agent has a good idea about world states once in a while. For a near- discernible 
POMDP, we propose a restricted value iteration algorithm that starts with a small belief 
subset and grows it gradually. The algorithm terminates as a proper tradeoff between size 
of the subset and policy quality is found. Because of near-discernibility, the algorithm is 
able to find a good tradeoff before the subset grows too large. 

The algorithms developed in this paper have been tested in a variety of small maze prob- 
lems designed to possess various properties as desired, and a number of problems adapted 
from existing research or created from our office environment. Our results show that by 
exploiting problem characteristics, restricted value iterations can solve larger POMDPs 
than standard value iteration. We show how the algorithmic performances vary with the 
properties of the selected belief subset for the maze problems. These small problems facili- 
tate exposition of the properties of the chosen belief subsets. Meanwhile, the experiments 
provide clues on which POMDP classes are amenable to perspective algorithms. 

The rest of the paper is organized as follows. In the next section, we introduce the 
POMDP model and value iteration. In the two subsequent sections, we present our subset 
value iteration algorithm and analyze its theoretical properties. In particular, in Section 



1. Set ^ is a proper subset of set B if (1) A is a subset of B, and (2) there exists at least one element in B 
such that it does not belong to A. 
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3, we show how to select the beUef subset and how the selected subset is related to the 
belief space. In Section 4, we describe the subset value iteration algorithm and discuss why 
it is able to achieve near optimality. In Section 5, we examine informative POMDPs and 
show how the algorithm that exploits the informativeness is related to the general subset 
value iteration (Zhang & Liu, 1997). In Section 6, we examine near-discernible POMDPs 
and develop an anytime algorithm. We empirically demonstrate that the algorithm is able 
to compute value functions of high quality. In Section 7, we survey related work to this 
research. 

2. POMDPs and Value Iteration 

This section gives a brief overview of the POMDP model and value iteration. 

2.1 POMDPs 

A POMDP is a sequential decision model for an agent who acts in a stochastic environment 
with only partial knowledge about the state of its environment. The set of possible states 
of the environment is referred to as the state space and is denoted by 5. At each point in 
time, the environment is in one of the possible states. The agent does not directly observe 
the state. Rather, it receives an observation about it. We denote the set of all possible 
observations as Z. After receiving the observation, the agent chooses an action from a set 
A of possible actions and executes that action. Thereafter, the agent receives an immediate 
reward and the environment evolves stochastically into a next state. 

Mathematically, a POMDP is specified by: three sets S, Z, and A; a reward function 
r{s, a) for s in 5 and a in ^; a transition probability function P{s'\s, a); and an observation 
probability function P{z\s', a) for z in Z and s' in S. The reward function characterizes the 
dependency of the immediate reward on the current state s and the current action a. The 
transition probability characterizes the dependency of the next state s' on the current state 
s and the current action a. The observation probability characterizes the dependency of 
the observation z at the next time point on the next state s' and the current action a. 

2.2 Policies and Value Functions 

Since the current observation does not necessarily fully reveal the identity of the current 
state, the agent needs to consider all previous observations and actions when choosing an 
action. Information about the current state contained in the current observation, previous 
observations, and previous actions can be summarized by a probability distribution over 
the state space (Astrom, 1965). The probability distribution is sometimes called a belief 
state and denoted by b. For any possible state s, b{s) is the probability that the current 
state is s. The set of all possible belief states is called the belief space. We denote it by B. 
A policy prescribes an action for each possible belief state. In other words, it is a 
mapping from B to A. Associated with a policy vr is its value function V^ . For each belief 
state b, V^{b) is the expected total discounted reward that the agent receives by following 
the policy starting from b, i.e., y^(6) = Ej^^f,\J2'tZo^*i^t] where rt is the reward received 
at time t and A (0<A<1) is the discount factor. It is known that there exists a policy vr* 
such that V^ ib)>V^{b) for any other policy vr and any belief state b (Puterman, 1994). 
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Such a policy is called an optimal policy. The value function of an optimal policy is called 
the optimal value function. We denote it by V*. For any positive number e, a policy vr is 
e-optimal if V^^b) + e > V*(b) for any b in B. 

2.3 Value Iteration 

To explain value iteration, we need to consider how the belief state evolves over time. Let b 
be the current belief state. The belief state at the next point in time is determined by the 
current belief state, the current action a, the next observation z. We denote it by r(5, a, z). 
For any state s', T{b, a, z) is given by 

/, w /x J2sPiz,s'\s,a)b{s) 

where P{z, s'|s, a)=P{z\s', a)P{s'\s, a) and P{z\b, a)= J2s s' P{^-> ■^'k, a)b{s) is the renormal- 
ization constant. As the notation suggests, the constant can also be interpreted as the 
probability of observing z after taking action a in belief state b. 

With the concept of belief state, a POMDP model can be transformed into a belief space 
MDP as follows. 

• The state space is B and the action space is A. 

• Given a belief state b and an action a, the transition model specifies the transition 
probability as follows. 

T3(h'\h \ — } P{A^^o) if &' = t(6, a, z) for some z, 
P{b\b,a)-^ Q otherwise. 

• Given a belief state b and action a, the reward model specifies immediate reward 
r(6, a) as r(6, a) = "^ses b{s)r{s, a). 

Due to this reformulation, the task of solving a POMDP can be accomplished by solving 
the reformulated MDP. It has been proven that the reformulated MDP has a stationary 
optimal policy, which can be found by stochastic dynamic programming (Bellman, 1957; 
Puterman, 1994). 

Value iteration is a dynamic programming algorithm for finding e-optimal policies for 
an MDP. It starts with an initial value function Vq and iterates using the following formula: 

K+i(6) = max[r(6,a) + A^P(z|6,a)V„(r(6,a,z))] V6 G S (2) 

2 

where Vn is referred to as the nth.- step value function. It is known that Vn geometrically 
converges to V* as n goes to infinity. 

For a given value function V , a policy vr is said to be V -improving if 

TT{b) = avgmax[r{b,a) + Xj2P{z\b,a)V{T{b,a,z))] \/b e B. (3) 

z 

The following theorem tells one when to terminate value iteration given a precision re- 
quirement e (Puterman, 1994). The stopping criterion depends on the quantity max^gg \Vn{b)- 
Vn-i{b)\, which is the maximum difference between Vn and Vn-i over the belief space. The 
quantity is often called Bellman residual between Vn and Vn-i (Puterman, 1994). 
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Theorem 1 If maxh \Vn{b) — Vn-i{b)\ < e(l — A)/(2A), then the Vn-i-improving policy is 
e-optimal. 



Since there are infinitely many belief states, value functions cannot be explicitly repre- 
sented. Fortunately, value functions that one encounters in the process of value iteration 
admit implicit finite representations (Sondik, 1971). 

2.4 Technical and Notational Considerations 

For convenience, we view functions over the state space as vectors of size \S\. We use lower 
case Greek letters a and (3 to refer to vectors and script letters V and U to refer to sets of 
vectors. In contrast, the upper case letters V and U always refer to value functions, that is 
functions over the belief space B. 

A set V of vectors induces a piecewise linear convex value function (say /) as follows: 
/(6) = maxagv ct'b for any bin B where a-b is the inner product of a and b. For convenience, 
we shall abuse notation and use V to denote both a set of vectors and the value function 
induced by the set. Under this convention, the quantity f{b) can be written as V{b). 

A vector in a set is useless if its removal does not affect the function that the set induces. 
It is useful otherwise. A set of vectors is minimal if it contains no useless vectors. Let a 
be a vector in set V. It is known that a is useful if and only if there is at least one belief 
state b such that a-b > a'-b, \/a' G V\{a}. Such a belief state is called a witness point of 
a because it testifies to the fact that a is useful (Kaelbling, Littman, & Cassandra, 1998). 
To determine the usefulness of a vector in a set, it is sufficient to solve one linear program. 
To compute a minimal set for a given set V of vectors, it is sufficient to solve |V| linear 
programs. The procedure of computing a minimal set for a given set of vectors is often 
referred to as pruning a set. 

2.5 Finite Representation of Value Functions and Value Iteration 

A value function V is represented by a set of vectors if it equals the value function induced 
by the set. When a value function is representable by a finite set of vectors, there is a 
unique minimal set that represents the function (Littman, Cassandra, & Kaelbling, 1995). 

Sondik (1971) has shown that if a value function is representable by a finite set of 
vectors, then so are the subsequent value functions derived by DP updates. The process 
of obtaining the minimal representation for Vn+i from the minimal representation of Vn is 
usually referred to as dynamic programming (DP) update. 

In practice, value iteration for POMDPs is not carried out directly in terms of value 
functions themselves. Rather, it is carried out in terms of sets of vectors that represent the 
value functions. One begins with an initial set of vectors Vq (often set to a zero- vector). 
At each iteration, one performs a DP update on the previous minimal set V„ of vectors 
and obtains a new minimal set Vn+i of vectors. One continues until the Bellman residual 
maxfe |Vn+i(6) — Vn(fe)|, which is determined by solving a sequence of linear programs, falls 
below a threshold. 
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3. Belief Subset Selection 

In this section, we show how to select a behef subset for value iteration. We describe 
a condition determining whether the selected subset is proper w.r.t. the belief space. In 
addition, we discuss the minimal representation of value functions w.r.t. the selected subset. 
In the next section, we develop the subset value iteration algorithm and show why it is able 
to achieve near optimality. 

3.1 Subset Selection 

Our belief subset selection rests on belief updating. Let the agent's current belief be b. Its 
next belief state is r(6, a, z) if it performs action a and receives observation z. If we vary 
the belief state b in the belief space B, we obtain a set {T{b,a,z)\b £ B}. Abusing our 
notation, we denote this set by t{B, a, z). In words, no matter which belief state the agent 
starts with, if it receives z after performing a, its next belief state must be in t{B, a, z). 

The union [Ja,zT{B, a, z) takes into account the sets of belief states for all possible com- 
binations of actions and observations. It contains all the belief states that the agent can 
encounter. In other words, the agent's belief state at any time point must belong to this 
set regardless of its initial belief state, performed actions and received observations. We 
denote the set by t{B,A^Z) or simply t{B). It is a closed set in the sense that no action 
can lead the agent to belief states outside t{B) if the agent starts with a belief state in it. 
Furthermore, any belief subset between the set t{B) and the belief space B is closed. 

Lemma 1 The set t{B) is closed. Moreover, if t{B) 'Z B' Q B, B' is closed. 

As is apparent, the set t{B) is a subset of the belief space B. Its definition is an application 
of reachability analysis (Boutilier, Brafman, & Geib, 1998; Dean et al., 1993). Under the 
terminology in reachability analysis, the subset T{B,a,z) comprises the one-step reachable 
belief states if the agent performs action a and receives observation z, while the subset t{B) 
comprises the one-step reachable belief states regardless of performed actions and received 
observations. Although the belief subset t{B) is the set of one-step reachable belief states, 
an appealing property, to be shown in the next subsection, is that value iteration working 
with it can preserve the quality of the generated value functions. 

3.2 Subset Representation 

Subset representation addresses how to represent the subsets T{B,a,z) and t{B). For this, 
we introduce the concept of belief simplex. 

Definition 1 Let B = {6i, 62; •••) b^} be a set of belief states. A belief simplex ^ generated 
by B is the set of belief states {X]i=i ^ibi\)^i > and J2i=i ^i = 1.0}. 

The set B is said to be a basis of the belief simplex ^. From the definition, the belief 
simplex (or simply simplex) is the set of convex combinations of the belief states in the 
basis. Following the standard terms in linear algebra, we can also talk about the minimum 
basis of a simplex. For convenience, we use notation Bij, to denote a basis of a given simplex 
^. Additionally, the simplex with the basis {61, b2,- • ■ , b^} is denoted by ^(61, b2,- • • , b^). 
Our result is that for any a and z, the subset t{B, a, z) is a simplex. The intuition 
follows. Let the number of states in a POMDP be n. For each i G {1, 2, . . . , n}, bi is a unit 
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vector, i.e., bi{s) equals 1.0 for s = i and 0.0 otherwise. For belief state bi, if P{z\bi, a) > 0, 
T{bi,a,z) is a belief state in T{B,a,z); if P{z\bi,a) = 0, by the belief update equation 
T{bi,a,z) is undefined. In the belief space B, it is trivial to note that any belief state can 
be represented as a convex combination of belief states in {6i, 62, • • • , ^n}- Correspondingly, 
in the belief subset t{B, a, z), any belief state can be represented as a convex combination 
of belief states in {T{bi, a, z)\P{z\bi, a) > 0}. Hence, {T{bi,a, z)\P{z\bi,a) > 0} is a basis of 
t{B, a, z). For convenience, we denote such a basis by B^(j2^n^^y 

Theorem 2 For any pair [a, z], the subset t{B, a, z) is a simplex. 

Proof: See Appendix A. □ 

By the above theorem, the subset t{B) is a union of simplices. Although the subset 
t{B) is not linearly representable in its own, it is the union of linearly representable sets. 
Later in the section, this property is crucial to and will be exploited in finding the minimal 
representing sets of value functions w.r.t. the belief subset t{B). 

To concretize the ideas on subset representation, we give a POMDP example and visu- 
alize the simplices for actions and observations. Before presenting the example, we mention 
that we shall use it for additional purposes later on in this paper. First, we shall use it 
to show the difference between two conditions determining whether the subset t{B) is a 
proper subset of the belief space. Second, we shall use it to demonstrate the fundamental 
differences between two restricted value iteration algorithms. 

Example The POMDP has three states {si, S2, S3}, two actions {oi, 02} and two observa- 
tions {zi,Z2}. We define the transition and observation model for action oi. These models 
for 02 can be defined similarly. To shorten notations, we use pij to denote the transition 
probability P{sj\si,ai) and qij to denote the observation probability P{zj\si,ai). We as- 
sume that (1) for any state Sj, the probability pn is equal to pi2, i.e., pn = pi2; (2) at each 
state Si, observations zi and Z2 are received with the same probability, i.e., qa = qi2 = 0.5 
for each i; and (3) pu > p2i > Psi- Under these assumptions, the matrix 

O.Spii 0.5p2i 0.5p3i 

Pai.i = I 0.5pii 0.5p2i 0.5p3i I . (4) 

0.5(1 - 2pii) 0.5(1 - 2p2i) 0.5(1 - 2p3i) 

Because of the first assumption, the first two rows of the matrix are the same. In the third 
row, the probability pi3 is replaced with 1 — pn — pi2, i.e., 1 — 2pii. 

We compute the basis of the belief subset t{B, ai,zi). Let the basis of the belief space 
B be the set {(1.0,0,0)^, (0,1.0,0)^, (0,0,1.0)'^}. (For a matrix or vector A, A^ denotes 
its transpose.) For action ai and observation 21, the next belief states, denoted by AjS, are: 

Ai = r((1.0,0,0f ,ai,zi) = (pii,pii,1.0 - 2pnf 

A2 =r((0,1.0,0f,ai,zi) = (p2i,P2i,1.0-2p2if (5) 

A3 = r((0, 0, 1.0)^, ai, zi) = (p3i,P3i, 1-0 - 2p3i)^ 

Interestingly, it can be shown that A2 is a convex combination of Ai and A^. In fact, in 
can be verified that 

r((0, 1.0, 0)^, ai, zi) = Air((1.0, 0, 0)^, ai,zi) + A2r((0, 0, 1.0)^, ai, zi) 
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where Ai = ^^^ ^^^ and At = 2iJ— 2^. Thus Ai + A2 = 1-0. Our third assumption ensures 

P11-P31 Pll-PSl X ' ^ Jr- 

that Ai and A2 are greater than 0.0. Because A2 is a convex combination of Ai and A^, the 
three behef states Ai to A3 he in the same straight hue. 

Figure 1 visuahzes the behef space B in the left and the simplex T{B,ai,zi) in the 
right. The right chart is based on these parameters: pn = 0.5, p2i = 0.4 and p^i = 0.1. 
The belief states AiS are the following: Ai = (0.5,0.5,0.0)^, A2 = (0.4,0.4,0.2)^ and 
A3 = (0.1, 0.1, 0.8) . We see that the belief space is a triangle area and the belief simplex is 
a line segment in that area. In the next subsection, we shall return to this point and show 
why. 



b(s3) 



(0,0,1.0) 



(0,0,0) 




b(sl) 



(1.0,0,0) 



(o,i.o,or"b(s2) 



b(s3) 

(0,0,1.0) t 



(0,0,0) 



A2 ^ b(sl) 

' (1.0,0,0) 




Figure 1 : A graphical representation of belief space and belief simplex. See text for expla- 
nations. 



To show the belief subset t{B), we continue to define the transition and the observation 
models for action 02.^ We may follow the three assumptions for ai to define these models 
for 02, but give a different set of transition probabilities so that 02 differs from oi. 

By the second assumption that observation zi and Z2 are received with the same prob- 
ability, the matrix Pai,zi is identical to Pai,z2 for i £ {1,2}. So, the simplex T{B,ai,zi) is 
identical to t{B, ai, Z2) for each i. As such, the subset t{B) consists of two line segments in 
the entire belief space. □ 



3.3 Belief Subset and Belief Space 

We discuss the relationship between the set t{B) and the belief space B. Since the set t{B) 
is a union of simplices, it helps to show how each simplex t{B, a, z) is related to the belief 
space B. For an action a and observation z, it turns out that a matrix derived from the 
transition and observation models plays a central role in determining the simplex. Such 
a matrix, denoted by Paz^ is of dimension \S\ x \S\ and its entry at (s,s') is the joint 
probability P{s' ,z\s,a), i.e., 



Pa 



I P(s'i,z|si,a) P{s[,z\s2,a) 
P{s'2,z\si,a) P{s'2,z\s2,a) 

V P{s'n,z\si,a) P{s'^,z\s2,a) 



P{s\,z\sn,a) \ 

P{s'2,z\Sn,a) 

P{s'^,z\sn,a) j 



2. Other components of the POMDP do not affect our discussions here and are omitted for convenience. 
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The matrix can be used to relate the next behef T{b,a,z) and the current b. If b and 
rib, a, z) are viewed in column vector form, the belief update Equation (1) can be rewritten 
as t(6, a, z) = . ?"^ N -Paz^- Hence, r(5, a, z) is a transformation of b and the matrix Paz is 
called the transformational matrix. The following lemma characterizes a condition under 
which the simplex t{B, a, z) is the same as the belief space B. 

Lemma 2 For any [a,z\, there exists a bijection between the simplex T{B,a,z) and the 
space B if the matrix Paz is invertible ^ . 

This can be seen from the fact r(6, a, z) = -j^m-asPazb and t{B, a, z) is the set of the 
transformed belief states from the belief space. Consequently, the simplex t{B, a, z) is a 
proper subset of B if the matrix Paz is degenerate. We note that the matrix Paz is degenerate 
if there exists a state s' such that P{z\s', a) = 0.0, i.e., there is a state such that the agent 
can never receive the observation z (if action a is performed). This is because all the entries 
in the row corresponding to s' are 0.0 in the matrix. In this case, by the lemma, the set 
t{B, a, z) must be a proper subset of the belief space B. 

Corollary 1 // there is a state s such that P{z\s,a) = 0.0, the set T{B,a,z) is a proper 
subset of the belief space B. 

However, that t{B, a, z) is a proper subset does not necessarily imply that there exists 
a state s such that P{z\s,a) = 0.0. Consequently, to determine if a belief simplex is 
proper, the above corollary provides a sufficient condition; in contrast. Lemma 2 provides 
a sufficient and necessary condition. This will be illustrated by continuing our discussions 
on the POMDP example. 

Example (Continued) We consider the simplex T{B,ai,zi). By the second assumption, 
zi can be observed at any state. So, there does not exist a state s such that P{zi\s, ai) = 0.0. 
On the other hand, the matrix Paizi is degenerate because it has the same two rows. By 
Lemma 2, the simplex T{B,ai,zi) is a proper set of the belief space. In fact, as seen in 
Figure 1, it is a line segment, which can be viewed as a degenerate belief space. □ 

We proceed to discover the relationship between the belief set T{B,a,z) and the belief 
space B. Since t{B, a, z) is the union of the simplices t{B, a, z), it is a proper subset of the 
belief space if so is each simplex. In turn, this requires that each transformational matrix 
is degenerate. 

Theorem 3 The subset t{B) is a proper subset of belief space B only if the transformational 
matrices for all actions and observations are degenerate. 

3.4 Subset Value Functions 

We discuss value functions whose domains are belief subset t{B, a, z) or t{B). For simplicity, 
we refer to them as subset value functions. The problem we will examine is, given a set of 
vectors representing a subset value function, how to compute a minimal set w.r.t. a belief 
subset. We first consider the case where the subset is a simplex. 



3. A matrix is invertible if its determinant is non-zero. It is degenerate otherwise. 
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In order to calculate a minimal set of vectors, one needs to determine the usefulness of 
a vector in a set w.r.t. the simplex. Let /? be a vector in set V and the simplex be t{B, a, z). 
The vector /3 is useful w.r.t. t{B, a, z) if and only if there is a belief state b in the simplex 
such that f3 ■ b > a ■ b + X where a; is a sufficiently small positive number and a is a vector 
in the set V— {/?}. Moreover, if such a belief state b exists, since it is in the simplex, b must 
be representable by the belief states in the basis B^f^Q^a,z)j i-6-, b = J2i^iT{bi,a, z). If we 
replace b by J2i ^i^ibi, a, z) in P ■ b > a ■ b + x, the condition of determining /3's usefulness 
is equivalent to this: whether there exists a series of nonnegative numbers AjS such that for 
any vector a in V, 

P ■ Y^ XiT{bi, a, z) > a ■ Y^ \iT{bi, a, z) + x. 

i i 

Rewriting the above inequality, we have 

Y [/3 • T{bi, a, z)] Aj > ^ [q • T{bi, a, z)] Aj + x. (6) 

i i 

To determine /?'s usefulness, the procedure simplexLP in Table 1 is used. When the 
optimality of the linear program is reached, one checks its objective x. If it is positive, there 
exists a belief state in belief simplex t{B, a, z) such that at this belief state (3 dominates 
other vectors. The belief state is a witness point of (3. It is represented as '^^\iT{bi,a^z) 
where AjS are the solutions (values of variables) of the linear program. In this case, the 
belief state X^i XiT^bi, a, z) is returned. For other cases, no belief state is returned and (3 is 
a useless vector. 

simplexLP(/3, V, 5r(B,a,2)): 

1. Variables: x, Aj for each i 

2. Maximize: x. 

3. Constraints: 

4- Ei [P ■ r{bi, a, z)] Xi > Ei [« • r{bi, a, z)] A^ + x for Va G V - {/?} 

and for each i, T{bi, a, z) € B^(^Q^a,z) 
5. E^ A, = 1, A, > for i. 

simplexPrune(V, Sr(B,a,2)): 

1. Zi^ V, V^0 

2. For each j3 vnU 

3. b<- simplexLP(/3,Z^, S^(B,a,2)) 

4. If 6 / null 

5. V^VU{/3} 

6. Return V 

Table 1: The procedure to compute the minimal set of vectors over a simplex 

To determine a vector's usefulness in a given set, one linear program needs to be solved. 
If a vector is useless, its removal does not change the value function that the set induces. 
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Therefore, to compute a minimal set for a given set V w.r.t. a simplex, one needs to solve 
|V| linear programs. 

This procedure is implemented in simplexPrune of Table 1. Its input has two arguments: 
a set of vectors V ^ and a basis of the simplex -Br{B,a,z)- A set lA is initialized to be the 
set V and the set V to be empty at line 1. Useful vectors are added to the set V in the 
sequel. For each vector in set lA, at line 3 the procedure simplexLP is called to determine its 
uselessness. If it returns a belief state, the vector is added to the set V at line 5. Eventually, 
the set V becomes the minimal representation of ^ w.r.t. simplex T{B,a,z). 

To compute a minimal set of vectors w.r.t. the subset t{B), one needs to determine a 
vector's usefulness w.r.t. the subset. In turn, one needs to determine its usefulness w.r.t. 
each simplex. Again, let the set be V and the vector be /3. If /3 is useful w.r.t. a simplex, it 
must be useful w.r.t. the subset. However, if it is useless w.r.t. a simplex, it may be useful 
w.r.t. another simplex. Hence, for a simplex, if /? has been identified as useful, there is no 
need to check it again for subsequent simplices. After all the simplices have been examined, 
if P is useless w.r.t. all simplices, it is useless w.r.t. the subset. By removing all useless 
vectors w.r.t. the subset, one obtains the minimal set. 

4. Subset Value Iteration 

In this section, we first describe the value iteration algorithm in belief subset t{B). We 
then show that the algorithm is able to achieve near optimality. Finally, we analyze its 
complexity and report empirical studies. 

4.1 Belief Subset MDP 

Because the subset t{B) is closed, we are able to define a so-called belief subset MDP (or 
simply subset MDP). Its state space is the chosen subset t{B) and other components are 
the same as those in the MDP transformed from the original POMDP (Section 2.3). The 
only difference between the two MDPs lies in their state spaces: the state space of the belief 
subset MDP is a subset of the state space of the belief space MDP. 

4.2 Subset DP Updates 

By MDP theory, the subset MDP admits the following DP update equation where Vn 
represents its nth-step value function. 

V;if{b) = max{r(6,a) + A^P(z|6, a)i;r(^)(r(fe, a, z))} Vb G t{B). (7) 

z 

Following the equation, an implicit DP update computes the minimal set V^^^ represent- 
ing value function l^^^ from Vn representing Vn ■ Note that the domains of value 
functions are belief subset t{B). For simplicity, such a step is called subset DP update. 

Implicit subset DP updates can be carried out as standard DP updates. Here, we present 
a two-pass algorithm due to its conceptual simplicity (Monahan, 1982). It constructs the 



4. For simplicity, we assume that the set V does not contain duphcate vectors. Duphcates can be removed 
by a simple componentwise check. 
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next set of vectors in two steps - an enumeration step enumerating all possible vectors and 
a reduction step removing useless vectors. In the following, we focus on the enumeration 
step. For the reduction step, since the usefulness of a vector is w.r.t. the subset t{B), the 
techniques in the preceding section can be used. 

Given a set Vn , a vector in the representing set of V^_^i can be defined by a pair of 

action and a mapping from the set Z of observations to the set Vn ■ To be precise, for an 
action a and a mapping 6, a vector, denoted by Pa^s^ is defined as follows ^. For each s € 5, 

(3aA^) = r(s, a) + A ^ ^ P{s'\s, a)P{z\s', a)6,{s') (8) 

z s' 

where 5z is the mapped vector for observation z. 

If we enumerate all possible combinations of actions and mappings above, we can define 
various vectors. These vectors form a set 

{Pa,sW ^A,6:Z^ V;(^) & Vz, 6, G V;(^)}. (9) 

The set is denoted by V^^/. By MDP theory, it represents the value function V^^-^ if the 
set Vn represents Vn ■ 

Lemma 3 The setV^\-i represents value function Vjl_^i ifVn represents Vn ■ 



The above DP update works in a collective fashion in that it directly computes value 
functions over t(B). An alternative way to conduct DP updates is to compute value func- 
tions for individual simplices one by one. The rationale is that, by letting a DP update 
work with the finer-grained belief subsets, it could be more efficient than its collective ver- 
sion. A DP update in an individual fashion constructs a collection {V^\_{'^''^ } of vector sets 
from a given collection {Vn '"'^ |a £ A, z ^ Z} where each Vn '""'^ represents Vn in the 
simplex r(S, a, z). We consider how to construct a set V^\^{^ '^ for one simplex t{B, a', z'). 

Likewise, a vector l3a,5 in V^\_{^ '^ can be defined by an action a and a mapping 6. The 
fact that r(6, a, z) must be in t{B, a, z) for any b implies that for any z, 6z can be restricted 
to a vector in the set Vn ' ■ By altering actions and mappings, one obtains the following 
set: 

{fiaAa e A 5:Z^ U,,,V;(^'"'^), & Vz, 5z G V^^^-'^'^)}. (10) 

It differs from (8) in that for an observation z, the mapped vector by 5 is restricted to the 
set Vn '"'^ . The above set is denoted by V^_|_i'" '^ . To obtain its minimal representation, 
one removes useless vectors from the set w.r.t. the simplex T{B,a', z'). The value function 
that the s 
r{B,a',z') 



set Vn '"'^ • The above set is denoted by V^_|_i'" '^ . To obtain its minimal representation, 

.r.t. 
that the set V^^_{" '^ induces is equal to the value function Vn in the belief simplex 



5. The procedure of defining a vector actually constructs an (n+f)th-step policy tree. (See, e.g., Zhang 
and Liu f997, for details.) 
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Lemma 4 For any action a and observation z, each set V^\_{^''^ represents the same value 

function V^^i in the simplex t{B, a, z) if each Vn '"'^ represents Vn in the same sim- 
plex. 

Although subset DP updates can be carried out either collectively or individually, they 
are essentially equivalent in terms of value functions induced. 

Theorem 4 LetU = Ua,zVj,ff''\ For any b G t{B), U{b) = Vlfl{b). 

It is worthwhile to note that for two pairs of actions and observations, the simplices 
r(0, 01,21) and r(jB, 02,^:2) might not be disjoint. A few remarks are in order for this 
case. First, by Theorem 4, for any b in the intersection of the simplices, V^_,_^'"^'^^ (6) = 

V^_,_i'"^'^^ (6). This is because both sets represent Vj^^^i in t{B). Second, if a subset DP 
update is carried out individually, it may generate more vectors than its collective version. 
This is because the two sets V^_^;^'"^'^^ and V^_,_^'"^'^^ may contain duplicate vectors. 

Finally, we note that to achieve computational savings, sophisticated algorithms for 
standard DP updates can be applied to subset DP updates. Let us take incremental pruning, 
one of the most efficient algorithms, as an example (Cassandra et al., 1997; Zhang &: Liu, 
1997). In standard incremental pruning, all the pruning operations are w.r.t. the belief 
space; however, when it is used in subset DP updates, all the pruning operations are w.r.t. 
the belief subsets. 

4.3 Analysis 

We analyze several theoretical properties of the subset value iteration algorithm. Our main 
results include: value functions generated by subset value iteration are equivalent to those 
by standard value iteration in some sense; to achieve near optimality, value iteration needs 
to account for at least the belief subset t{B); the value function generated by subset value 
iteration can be used for near optimal decision-making in the entire belief space if the 
algorithm is appropriately terminated. 

4.3.1 Belief Subset, Value Functions and Value Iterations 

Subset value iteration generates a series {V^ } of value functions. If its initial value 
function V^^ is the same as the initial Vq of standard value iteration in t{B), subset value 
iteration generates the same series of value functions as standard value iteration in t{B). 

Theorem 5 If Vq (6) = V^ib) for any b in t{B), then Vn ip) = Vn{b) for any n and 
any b in t{B). 

Proof: We first consider one DP update computing value function Vn+i from the current 
Vn by DP Equation (2). In its right hand side, since T{b,a,z) must belong to the subset 
t{B), the notation Vn{T{., ., .)) can be interpreted as a value function over subset t{B) rather 
than belief space B. Comparing DP Equation (2) for the belief space B and Equation (7) 
for belief subset t{B), we see that Vn+i and V^^^ represent the same value function in t{B) 
if so do Vn and Vn ■ 
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The theorem is true for n = by the given condition. It is true for n > by induction. 
D 

More interestingly, the value function Vn at step n can be used to derive the value 
function Vn+i in standard value iteration. To see why, we first note that a subset value func- 
tion V^^ ' can be used to define a value function (say V) by one-step lookahead operation 
as follows: 

y(6) = max{r(6,o) + A^P(z|6,a)y^(^)(r(6,a,z))} V 6gR (U) 

z 

The so-defined V is called V'^^'^' -improving value function. Second, comparing Equations 
(11) and (2), we see that the Vn -improving value function is actually V^+i if Vn is 
equal to Vn in t{B). 

Consequently, although subset value iteration works with t{B), value functions gener- 
ated in standard value iteration can be derived. In this sense, we say t{B) is a sufficient 
belief subset since it enables subset value iteration to preserve standard value functions 
without "loss". 

Since subset value iteration retains the quality of value functions, it can be regarded as 
an exact algorithm. One interesting question is, if value iteration intends to retain quality, 
can it work with a proper subset of t{B)1 In general, the answer is no. The reason follows. 
To compute Vn+i, one needs to keep values Vn for belief states in t{B). Otherwise, if 
one accounts for a proper subset B' of t{B), it can be proven that there exists a belief state 
b in B, an action a and an observation z such that r(6, a, z) does not belong to B' . It's 
known that the value update of Vn+i{h) depends on the values for all possible next belief 
states. Due to the unavailability of Vn \T{b, a, z)), the value Vn+i{b) cannot be calculated 
exactly. Consequently, if value iteration works with a proper subset of t{B), it cannot be 
exact. In other words, it should be an approximate algorithm. To make it be exact, value 
iteration needs consider at least t{B). In this sense, the subset t{B) is said to be a minimal 
sufficient set. 

Informally we use Figure 2 to illustrate the relationship between belief subsets and value 
iteration. In the figure, circles represent belief sets. The minimum belief subset for value 
iteration to retain quality is t{B), while the maximum subset is the belief space B itself. If 
value iteration works with belief subset B' (denoted by dashed circles) between t{B) and 
B, its quality is also retained. However, if it works with a proper belief subset of t{B), in 
general it is unable to retain the quality of value functions. 



Figure 2: The relationship between belief subsets and value iteration 
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4.3.2 Stopping criterion and decision making 

Subset value iteration starts with an initial value function. As it continues, the Bellman 
residual max^g^j-g) \Vn (&) — F,^_i (6)[, the maximum difference between two value func- 
tions over the subset t{B), must become smaller. By MDP theory, if the quantity falls below 
e(l — A)/(2A), the value function Vl}_i is e-optimal w.r.t. the subset MDP. In the following, 
we show that the e-optimality can be extended to the entire belief space by appropriately 
terminating the subset value iteration algorithm. 

Let the output value function be V^_i . It can be used to define a policy for any belief 
state in B as in Equation (3) where V is replaced by V^_i . The policy vr is said to be 

V^_i -improving. Note that the policy prescribes an action for any belief state in the belief 
space. The following theorem tells one when to terminate subset value iteration such that 
the V^_i -improving policy is e-optimal over the belief space. 

Theorem 6 IfmaxheriB) \yn^'^\b)-V^''Sf ib)\ < e(l - X)/{2X^\Z\), then the V.^^[f -improving 
policy is e-optimal over the entire belief space B. 

Proof: See Appendix A. □ 

This theorem is important for two reasons. First, although subset value iteration outputs 
a subset value function, e-optimal value functions over the entire space can be induced by 
one-step lookahead operation. Second, it implies that the V^_]^ -improving policy is e- 
optimal if the condition is met. We know that t{B) consists of all possible belief states the 
agent encounters after the initial belief state. However, we have no assumption about the 
initial belief state. It may or may not belong to this set. The theorem means that the agent 
is still able to select a near optimal action for an initial belief state even if it is not in the 
subset. In fact, the agent can always select near optimal action for any belief state in the 
entire belief space. 

Finally, we note that to guarantee the e-optimality, when compared with the condition 
in Theorem 1, subset value iteration uses a more restrictive condition. For convenience, it 
is sometimes called a strict stopping criterion. In contrast, the condition in standard value 
iteration is called a loose stopping criterion. 

4.4 Complexity 

To put in use for a POMDP, the subset value iteration algorithm would take two steps: 
determining if the algorithm can bring about savings in time and then running the subset 
value iteration if it can. The first step needs to compute |^||^| determinants \Paz\- Since 
the complexity of computing \Paz\ is |5| , the first step has the complexity ofO(j^||2^||5| ). 
This is the polynomial part of the complexity of subset value iteration. The second step 
is much harder than the first step. It is known that finding the optimal policy for even a 
simplified finite horizon POMDP is PSPACE-complete (Papadimitriou & Tsitsiklis, 1987; 
Burago, de Rougemont, & Slissekno, 1996; Liftman, Goldsmith, & Mundhenk, 1998). Re- 
cently, it has been proven that finding the optimal policy for an infinite-horizon POMDP 
is incomputable (Madani., Hanks, & Condon, 1999). 

We compare the subset value iteration with standard value iteration. Standard DP 
updates improve values for the space B, while subset DP updates improve values for the 
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subset t{B). If the initial set Vq is equal to the initial set Vq in standard value iteration, 
because t{B) is a subset of B, the vectors in V^ must be in Vi, and Vl is a subset of 
Vi- Inductively, Vn is a subset of Vn for any n. This analysis suggests two advantages 
of subset value iteration if the subset t{B) is a proper subset of the belief space B. First, 
fewer vectors are needed to represent a value function over a belief subset. This is the 
representational advantage in space. For Vn+i, its size can be as large as |^||V„|' '. For 

^n+i ^ its si^^ ^^^ ^^ ^s large as |^||Vn | • Clearly, subset DP update generates fewer 
vectors. Second, fewer vectors means lesser degree of time complexity since computing 
vectors needs to solve linear programs. This is the computational advantage in time. 

However, the advantages strongly depend upon the size of the subset t{B). If each 
simplex T{B,a,z) is the same as the belief space B and DP updates are conducted in an 
individual fashion, subset DP updates could be |^||-Z| times slower than standard DP 
updates. This is the worst case complexity. Fortunately, by discussions in the previous 
section (Theorem 3), we know that given a POMDP we are able to determine whether the 
selected subset t{B) is a proper subset of the belief space before solving it. 

Although Theorem 3 gives a condition to determine when the subset value iteration is 
more efficient than standard value iteration for a POMDP, it does not answer the question 
that how much savings the algorithm can bring about, which turns out to be a very difficult 
problem in theoretical analysis. The difficulty lies in not only the size of the set t{B) but 
also the vectors representing the step and the optimal value functions. Let us assume that 
t{B) is a proper subset of the belief space. We can imagine at least two cases. In one case, 
if at each iteration the step value function has very few useful vectors in the subset t{B), 
the subset value iteration can be very efficient. In the other case, if at each iteration the 
step value function has all useful vectors in the subset t{B), subset value iteration has the 
same complexity as standard value iteration in an asymptotic sense. In general, given a 
POMDP, it is difficult to predict how these vectors scatter around the belief subsets and the 
belief space. Consequently, it is hard to predict how much saving the subset value iteration 
algorithm can bring about for a POMDP without solving it. 

4.5 Empirical Studies 

We present our empirical results on two variants of a designed maze problem and the 
problems in a standard test-bed in this subsection. Some common settings to all experiments 
in the paper are as follows. The experiments are conducted on an UltraSparc II machine 
with dual CPUs and 256MB of RAM. Our codes are written in C and executed under 
a UNIX operating system Sola 2.6. When solving linear programs, we use a commercial 
package CPLEX V6.0. The discount factor is set at 0.95 and round-off precision is set 
at 10^^. When not stated otherwise, the quality requirement e is set to 0.01. We use 
incremental pruning to compute representing sets of value functions over the belief space 
or belief subsets. 

We compare the performances of subset and standard value iteration. For simplicity, 
we denote them respectively by ssVI and VI. At each iteration, we compare VI and ssVI 
in two measures: sizes of sets representing value functions and total time of DP updates. 

138 



Restricted Value Iteration: Theory and Algorithms 



4.5.1 The Maze Problem 

The maze problem is specified in Figure 3. There are 10 locations and the goal is location 9. 
A robot agent can execute four "move" actions to change its position, optionally a "look" 
action to observe its surroundings and a "declare" action to announce its success of goal- 
attainment. The "move" actions can achieve their intended effects with probability 0.8, 
but might have no effects with probability 0.1 (the agent's position remains unchanged) or 
lead to overshooting with probability 0.1. Moving against maze walls leaves the agent at its 
original location. Other actions do not change the agent's position. At each time point, the 
robot receives a "null" observation giving no useful information at all, or reads four sensors 
so as to reason about its current position. Each sensor informs the robot whether there is a 
wall or nothing along a direction. In the figure thick lines stand for walls and thin lines for 
nothing (open). For instance, if the agent is at location 2, ideally a string "owow" (in the 
order of East, South, West and North) is received. Specific parameters will be instantiated 
in relevant empirical analysis. The robot is required to maximize the infinite discounted 
sum of rewards. 
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Figure 3: A maze problem 

Two variants of the above maze are designed to test ssVI and VI. They are denoted by 
mazel and inaze2. For mazel, ssVI is more efficient; for maze2, ssVI is less efficient. 

Case I: t{B) C B 

The mazel problem has a state space of 10 locations, an action space of size 5 (four 
"move" and one declare) and an observation space of size 6 (strings of four letters). An 
ideal string is received with certainty after any action is performed. When the agent declares 
goal at location 9, it receives a reward of 1 unit; if it does so at location 10, it receives a 
reward of —1. Other combinations of actions and observations lead to no reward. 

We collect the results in Figure 4. The first chart in the figure depicts the total time of 
DP updates in log-scale for VI with the loose stopping criterion and ssVI with the strict one 
(Section 4.3.2). To compute a 0.01-optimal value function, VI took 20,000 seconds after 162 
iterations while ssVI with strict stopping criterion took 900 seconds after 197 iterations. 
We note that ssVI needs more iterations but it still takes much less time. The performance 
difference is big. Moreover, more iterations means that the value function generated by 
ssVI is closer to optimality. 

This is not a surprising result if we take a look at the matrix Paz for an action a 
and observation z. We know that the matrix impacts the size of the simplex T{B,a,z). 
The dimension of the matrix is 10 x 10. The entry of Paz at {i,j) is the product of the 
transition probability P{sj\si, a) and observation probability P{z\sj, a). Let us assume that 
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Figure 4: Comparative studies for VI and ssVI on Mazel 

the observation be owow. Hence, the possible focations may be 2 or 5. Regardless of actions 
executed, only entries in row 2 and 5 of Paz can be non-zero. Therefore, the matrix is highly 
sparse and non-invertible and the simplex t{B, a, z) is much smaller than B. This analysis 
holds similarly for other combinations of actions and observations. Hence, ssVI accounts 
for only a small portion of the belief space. This explains why ssVI is more efficient than 
VI. In addition, we expect that the sets generated by ssVI are much smaller than those by 
VI. 

This is confirmed in the second chart of the figure. It depicts the sizes of the sets 
representing the value functions generated by ssVI and VI at each iteration. When counting 
the size of Vn , we collect the sum of the sizes of representing sets over |^|[^| simplices. 
We note that at the same iteration VI always generates much more vectors than ssVI. The 
sizes at both curves increase sharply at first iterations and then stabilize. The size for VI 
reaches its peak of 2466 at iteration 11 and the maximum size for ssVI is 139 at iteration 
10. This size in VI is about 20 times many as that in ssVI. This is a magnitude consistent 
with the performance difference. After the sizes stabilize, the sizes of the sets generated by 
VI are around 130 and they are around 50 in ssVI. 

Case II: t{B) = B 

The problem maze2 is designed to show that ssVI could be less efficient than VI when 
the selected belief subset t{B) is equal to the belief space B. The problem has a state space 
of 10 locations, an action space of size 6 (four moves, one stay and one declare) and an 
observation space of size 7 (6 strings and a null telling nothing). The action stay does not 
change the agent's position. maze2 has more complications on the observation model. Due 
to hardware limitations, after a move action, with a probability of 0.1, the agent receives a 
wrong report where the string owow is collected as owww and woww as wowo. If the declare 
action is executed, the agent always receives a null observation. In addition, if the agent 
executes stay, it receives either a null observation with probability 0.9 or the ideal string 
about the surrounding locations with probability 0.1. 

The reward model is accordingly changed to reflect new design considerations. We 
assume that the agent needs to pay for its information about states. For this purpose, if the 
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agent executes stay, it really does nothing and thus yields no cost (i.e., negative reward). 
In contrast, the "move" actions always cause a cost of 2. Depending on the locations at 
which it executes declare, it receives rewards or costs: if the location is state 9, it receives 
a reward of 1; if state 10, it receives a cost of 1; otherwise, it leads to no rewards. The 
stay action is attractive in that it yields no cost but it leads to an useful observation about 
states with a small likelihood. 

The empirical results are collected in Figure 5. First, we note both VI and ssVI are 
able to run only 11 iterations within a reasonable time limit (8 hours). The first chart in 
the figure presents the time costs along iterations. To run 11 iterations, ssVI takes 53,000 
seconds while VI takes around 30,900 seconds. Therefore, ssVI is slower than VI for this 
problem. However, the magnitude of performance difference is not big. To explain this, let 
us consider the matrix Paz for action stay and observation null. The transition matrix 
is an identity and each state can lead to the null observation with probability of 0.9 if 
stay is executed. Therefore, the matrix Paz is invertible and the simplex T{B,a,z) is the 
same as the belief space B. Because ssVI needs to account for additional simplices for other 
combinations of actions and observations, ssVI must be less efficient than VI. This explains 
the performance difference in time between ssVI and VI. 
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Figure 5: Comparative studies for VI and ssVI on maze2 

It is also anticipated that ssVI should generate more vectors than VI at the same 
iteration because the size of Vn is defined to be sum of the individual sets in it. This is 
confirmed and demonstrated in the second chart of Figure 5. The curve for ssVI is always 
on the upper side of that for VI. For the 11th iteration, ssVI generates 3,300 vectors and 
VI generates around 1,700 vectors. 

4.5.2 More Experiments on the Test-Bed 

To validate the performance of the subset value iteration over different problem domains, 
we collected the results of the algorithm on the standard test-bed maintained by Tony 
Cassandra ^ . In the literature, the eight problems are commonly referred to as 4x3CO, 
Cheese, 4x4, Part Painting, Tiger, Shuttle, Network, and Aircraft. Table 2 presents detailed 



6. See the URL http://pomdp.org/pomdp/examples/index.shtml 
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Tiger 


Shuttle 


Network 


Aircraft 


\s\ 


11 


11 


16 


4 


2 


8 


7 


12 


\z\ 


4 


4 


2 


4 


2 


2 


2 


5 


\A\ 


11 


7 


4 


2 


3 


3 


4 


6 


VI(Time) 


3.52 


14.06 


27.47 


38.75 


82.40 


6130.69 


13283.15 


1723193.34 


ssVI(Time) 


63.28 


85.44 


85.44 


85.20 


145.21 


1437.32 


2810.21 


425786.49 


vi(#) 


4 


14 


20 


9 


9 


208 


491 


2071 


ssVI(#) 


43/1 


32/2 


42/10 


22/9 


22/9 


98/45 


201/50 


3236/428 


subspace 


yes 


yes 


yes 


no 


no 


yes 


yes 


yes 



Table 2: Comparative studies for ssVI and VI on the standard test-bed 



statistics for these problems. In the table, Rows 2-4 give the sizes of problem parameters, 
namely the number of states, observations and actions. Row 5 and 6 show the CPU seconds 
for the standard and subset value iteration algorithms to compute the 0.01-optimal policy 
for each problem. Row 7 shows the number of the vectors representing the 0.01-optimal 
value function in standard value iteration. In our experiments, we implemented the subset 
value iteration in the individual fashion. In Row 8, an entry takes the form •/•, denoting 
the total number of vectors over all \A\ ■ \Z\ simplices and the maximum number of vectors 
among these simplices when the subset value iteration terminates. The last row shows 
whether the belief subset t{B) is a proper subset of the belief space. 

In discussing the performances of the subset value iteration algorithm, we categorize the 
tested problems into three classes. In the first class, the subset t{B) is actually the same as 
the belief space. Subset value iteration must be less efficient than standard value iteration. 
The reason is as follows: there exists at least one belief simplex such that value iteration 
over it has the same complexity as standard value iteration; moreover, subset value iteration 
needs to account for other simplices. Example problems are tiger and paint. Let us take the 
paint problem as an instance. Our results show that there are two simplices are the same as 
the belief space. The 0.01-optimal value functions over them are represented by 9 vectors, 
each of which has the same number of vectors representing the 0.01-optimal value function 
over the entire belief space. In the second class of the tested problems, the set t{B) is a 
proper subset of the belief space, and meanwhile the numbers of the vectors representing 
value functions over the belief space and individual simplices are very small. Subset value 
iteration may not be so efficient as standard value iteration because of the overhead of 
accounting for a large number of simplices. Example problems include 4X3CO, cheese and 
4X4. Let us take 4X3CO as an instance. The 0.01-optimal value function over the entire 
belief space is represented by only 4 vectors, whereas the 0.01-optimal value function over 
each simplex is represented by only 1 vector. Since the subset value iteration has to account 
for 44 simplices, subset value iteration is less efficient than standard value iteration. In the 
third class of tested problems, the set t{B) is a proper subset of the belief space, and 
meanwhile the numbers of the vectors representing value functions over the belief space are 
moderately large. The subset value iteration algorithm is more efficient than the standard 
value iteration algorithm. Examples include shuttle, network and aircraft. Let us take 
network as an instance. The 0.01-optimal value function in the belief space is represented 
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by 491 vectors, whereas the 0.01-optimal value function in t{B) is represented by less than 
201 vectors (note that there are duplicates across belief simplices). The maximum size of 
the representing sets over all simplices is 50. In this case, we expect that the savings brought 
by the subset value iteration outweighs the overhead of accounting for the simplices. The 
result shows that subset value iteration is about 5 times faster than standard value iteration. 
Combining these results with those on the maze problem, we see that the computational 
savings brought by the subset value iteration vary with different problem domains. Theorem 
3 can be used to determine whether subset value iteration can bring about computational 
savings for a POMDP. In the event that the belief set t{B) is a proper subset of the belief 
space, the magnitude of the savings needs to be determined through empirical evaluation. 

5. Informative POMDPs 

In this section, we study a special POMDP class, namely informative POMDPs. For this 
POMDP class, there are natural belief subsets for value iteration to work with. We will 
show how to formally define these subsets. As the value iteration over these belief subsets 
has been described (Zhang &: Liu, 1997), our focus is to compare the algorithm with the 
general subset value iteration developed in the previous section. 

5.1 Motivation 

As noted by some authors, in reality an agent often has a good, although imperfect, idea 
about its locations (Roy & Gordon, 2002). For instance, mobile robots and other real world 
systems have local uncertainty, but rarely encounter global uncertainty. Let us exemplify 
this using the maze in Figure 3. Suppose that at each time point an agent receives a string 
of four letters with certainty. In total, there are 6 observations, owww, owow, owoo, wwow, 
wowo and woww regardless of executed actions. If we enumerate all possible observations 
and the set of locations at which the agent receives such observations, we end up with the 
following table. 



observations 


states 


observations 


states 


owww 
owoo 
wowo 


{1} 
{3,4} 
{7,8} 


owow 
wwow 
woww 


{2,5} 

{6} 

{ 9,10 } 



On the other hand, the strings can be used to infer the agent's locations. For instance, if a 
string owoo is received, the world must be at location 3 or 4. Hence, the observation owoo 
restricts the world into a small range of world states. In fact, any observation can restrict 
the world into at most two states although the world has ten. For this reason, the POMDP 
is said to be informative. 

In general, an agent perceives the world via observations. Starting from any state, if the 
agent executes an action a and receives an observation z, the world states can be categorized 
into two classes by the observation model: states the agent can be in and states it cannot. 
Formally, the former is {s\s E S and P{z\s, a) > 0}. The set is denoted by 5"^. We use the 
set to define the informativeness. An [a,z] pair is said to be informative if the size |5"^| 
is much smaller than \S\. An observation z is informative if [a,z] is informative for every 
action a giving rise to z. A POMDP is informative if all observations are informative. 
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In informative POMDPs, since any observation restricts the world into a small set of 
states, the agent knows that the world cannot be in the state outside this small set. In 
other words, for those states outside the set, the agent has zero beliefs. Consequently, an 
observation can also restrict belief states into a belief subset. 

5.2 Belief Subset Selection 

For informative POMDPs, we can select belief subset t{B) as before. Combining the infor- 
mativeness assumption and Corollary 1, we know that t{B) is a proper subset of the belief 
space. So, value iteration over t{B) carries the space and time savings. In this section, we 
choose an alternative belief subset for value iteration. Compared against the subset t{B), 
the subset that we choose yields several advantages. First, it is conceptually simple and 
geometrically intuitive. Second, it facilitates employing the low dimensional representation 
of vectors. Third, it may lead to additional savings in time if the observation models of a 
POMDP are independent of actions. The latter two advantages will be shown later. 

To define the belief subset (say (p{B)), we first define a subset (j){B, a, z) for an action and 
observation pair. Then, the belief subset (piB) is formed by taking the union of (p{B, a, z) 
over all action and observation pairs. To be specific, 

^{B, a, z) = {b\ Y, b{s) = 1.0, Vs G 5"^ b{s) > 0} (12) 

and 

<P{B) = Ua,zHl3,a,z). 

It is trivial to see that 4>{B, a, z) is a belief simplex. It can be proven that for any belief state 
b, T{b, a, z) must be in (p{B, a, z). Therefore, t{B^ a, z) is a subset of 0(S, a, z). Consequently, 
t{B) is a subset of (p{B). This is summarized in the lemma below. The lemma is useful 
when we discuss the value iteration algorithm working with the belief subset (t){B). 

Lemma 5 For a POMDP, t{B) C ,/)(S). 

It is of interest to compare the r-simplex and (^-simplex for a pair of a and z. Although 
both simplices are generated by a list of belief states, ^-simplex has more intuitive geometric 
meaning. Each belief state in the basis of (p{B,a,z) is a unit vector, i.e., it has probability 
mass on one state. Therefore, the belief state in the basis must be a boundary point of the 
belief space. In contrast, a belief state in the basis of a r-simplex can be an interior point. 
See Figure 1 for an example, where A2, A^ are interior points and Ai is a boundary point 
of the belief space. 

5.3 Value Iteration over cl){B) 

From the theoretical perspective, the feasibility of conducting value iteration in </>(S) is 
justified by Lemma 1. Combined with Lemma 5, the subset (j){B) is a closed set. Hence, the 
MDP theory is applicable to defining the DP update equation. By our discussions on the 
relationship and value iteration in Section 4.3, value iteration working with (j){B) retains 
the quality of value functions. 

We further exploit the informative feature in value iteration over (piB). We briefly outline 
the subset value iteration algorithm and refer the readers to a detailed description (Zhang 
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& Liu, 1997). The basic idea is to reduce the dimensions of vectors in representing sets of 
value functions. Note that for any pair [a, z], since the behefs in states outside 5"^ are zero, 
a vector in the representing set of a value function over the simplex (j){B, a, z) needs only 
l^azj components. In an individual fashion, a DP update over (j){B) computes a collection 
\y'^\_{"''^ } from a collection {Vn '""'^'^ where Vn '""'^ is the nth-step value function and 
the vectors in it have |5"^| dimensions. The procedure of conducting a DP update is parallel 
to that in Section 4.2 except that T{B,a,z) is replaced by (p{B,a,z). In the enumeration 
step, when building a vector /3 in a belief simplex 4>{B, a' , z') using Equation (10), we need 
only define its components corresponding to the set 5" ^ . In the reduction step, for each 
constructed set V„_,_i'" '^ , a pruning procedure is called to remove useless vectors to obtain 
the minimal representation of the set. Note that the lower dimension feature is also used 
to cut down the number of variables in setting up linear programs. 

Interestingly, DP updates over (piB) account for a larger subset than those over t{B). 
Hopefully, since DP updates over (piB) explicitly employ the economy of representation, 
they could be more efficient. In addition, DP updates over (pi^B) have another advantage 
in the event that the observation models of a POMDP are independent of actions, i.e., the 
probabilities P{z\s,a) being independent of a. Hence, given an observation z, the simplices 
(j){B, a, z) are the same for all actions. Therefore, DP updates over </>(fi) only account 
for \Z\ (/)-simplices. However, DP updates over t{B) usually need to account for |^||-Z[ 
r-simplices because an observation determines different r-simplices when combined with 
different actions. 

5.4 Empirical Studies 

We have conducted experiments to compare VI, ssVI and info VI, which refers to value it- 
eration exploiting the low-dimension feature. The experiments on mazel (defined in Section 
4.5) can be found elsewhere (Zhang & Zhang, 2001; Zhang, 2001). The results, together 
with existing results (Zhang &: Liu, 1997), showed that value iteration over (j){B) can be 
significantly more efficient than standard value iteration. For reference, we mention that 
it is feasible to integrate a point-based technique and value iteration over (piB) in order to 
take advantage of both reducing the iteration number and accelerating the iterative steps 
(Zhang & Zhang, 2001b). To demonstrate this, we include results on a 96-state POMDP 
in Appendix B. 

5.5 Restricted Value Iteration and Dimension Reduction 

We compare the value iteration algorithms in this and the previous section. Through the 
comparison, we would like to emphasize that working with belief subsets does not imply 
working with low-dimensional vectors. 

Although both algorithms work with belief subsets, the mechanisms exploited to achieve 
the computational gains are different. The general value iteration works with the belief 
subset t{B) but the dimension of representing vectors is the same as the number of states, 
whereas value iteration over (j){B) works with a superset of t{B) but the dimension of the 
vectors is smaller than the number of the states. To facilitate demonstrating how a reduced 
belief set and the low-dimensional representation respectively contribute to the computa- 
tional gains, we experimented with a carefully designed maze problem that is amenable to 
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both algorithms. However, it is worth pointing out that working with a reduced behef set 
does not mean that the vectors can be represented in low dimensions. We illustrate this 
point by continuing our discussions of the example in Section 3.3. Such an example shows 
that the primary advantage of value iteration over t{B) stems from the size of the chosen 
belief subset rather than dimension reduction of the representing vectors. 

Example (Continued) For the POMDP example presented in Section 3.2, the subset t{B) 
consists of only two line segments in the entire belief space. Clearly value iteration over 
t{B) is more efficient than standard value iteration. However, if one runs value iteration 
over (piB) on this POMDP anyhow, the algorithm is less efficient than the standard value 
iteration algorithm. This follows from (1) the set S""*^^ is equal to the set of states S for any 
action a, and observation Zj by the second assumption, and (2) each (/>-simplex is actually 
the same as the belief space by the definition in Equation (12). To solve this POMDP, 
the susbet value iteration algorithm is definitely a better choice than the value iteration 
algorithm for informative POMDPs. □ 

6. Near-Discernible POMDPs 

In this section, we study near-discernible POMDPs. For this POMDP class, we develop an 
anytime value iteration algorithm working with growing belief subsets. 

6.1 Motivation 

A discernible POMDP assumes that once in a while the uncertainty about world states 
vanishes if a particular action is executed and the observations pertain to the action fully 
reveal the identities of the world (Hansen, 1998). Our research on near-discernible POMDPs 
was motivated by two aspects. One of them arises from the origin of applying POMDP as 
a framework for planning under uncertainty. To achieve a goal location, an agent has to 
not only change its positions by performing goal-achieving actions but also reason about its 
surroundings by performing information-gathering actions. However, at one time point the 
agent cannot simultaneously move its positions and observe its environments. For instance, 
if an information-gathering action is performed, the agent cannot move its positions mean- 
while. The other aspect motivating the concept of near-discernibility arises from existing 
research in the community. Near-discernible POMDPs generalize discernible POMDPs in 
that even when an information-gathering action is performed, the agent can get a rough, 
rather than exact, idea about world states and uncertainty vanishes in some sense. 

We revise the maze problem to fix ideas on the first motivation. The action space consists 
of six actions: four "moving" actions, look and declare. If move actions or declare are 
performed, an observation null is received and the agent gets no information at all. If 
look is performed, an ideal string is received and the agent gets imperfect information since 
different locations might yield the same string. On one hand, to achieve the goal location, 
the agent has to change its positions. On the other hand, to declare goal attainment with 
confidence, it has to perform look and reason about the environment. Arbitrarily declaring 
goal attainment leads to a penalty. Consequently, at a time point the agent faces the 
problem of choosing a move or look. 
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We note that the subset value iteration algorithm usually yields no computational ad- 
vantage for near-discernible POMDPs. We give an example in which the subset t{B) is the 
same as the belief space B under some assumptions. Suppose the maze is a square grid. 
Locations are numbered such that at each row the indices of the locations increase from 
left to right. We assume that a move action achieves its intended effects with a high likeli- 
hood, but may have no effect (i.e., the agent's location remains unchanged) or may lead to 
overshooting with a small probability. Under these assumptions, the transition matrix for 
action east is upper-triangular and invertible. If at each location null is received with a 
positive probability after a move, the transformational matrix i^east null i^ invertible. By 
Theorem 3, the belief subset t{B, east, null) is equal to the belief space B. 

Our solution to near-discernible POMDPs rests on the intuition that the agent needs 
to interleave goal- achieving actions and information-gathering actions. A typical sequence 
of executed actions should consist of several goal-achieving actions and an information- 
gathering action. The difficulty is how frequently the agent should execute an information- 
gathering action. In this section, we consider the action and observation sequences con- 
taining more goal-achieving actions incrementally. We show that such sequences can be 
used to determine belief simplices. As more sequences are added, the union of the belief 
simplices grows. In the following, we give some technical preparations and then describe the 
algorithm designed for near-discernible POMDPs. In order to put our discussions under 
a general context, we shall use information-rich and information-poor actions instead of 
information-gathering and goal- achieving actions respectively. 

6.2 Histories, Belief Subsets and Value Functions 

A history is a sequence of ordered pairs of actions and observations. We usually denote a 
history by h. The number of pairs of actions and observations is referred to as the length of 
a history. A history of length / is denoted by [ai, zi,- ■ ■ ,ai, zi]. If an agent's initial belief 
state is b and a history h of length I is realized, its belief state can be updated at each 
time step. The notation T{b,h) denotes the belief at the time point /. The set T{B,h) is 
defined to be Uf,^QT{b, h), consisting of all possible belief states that the agent can be in at 
step / if it starts with any belief and history h is realized. Note that if h is of length l(say 
h = [a,z]), t{B, h) degenerates to our previous notation r(S, a, z). 

Lemma 6 For any history h, the belief subset t{B, h) is a simplex. 

A set of histories is usually denoted by 7i. The belief subset t{B, 7i) denotes the union 
of simplices for all histories in the set 7i, i.e., UhenTi^i h). Value functions over the simplex 
t{B, h) and belief subset t{B, Ti) are referred to as y^(^'^) and V^^^'"^' respectively. Given 
a set V^*^ ' •* representing value function V'^^ ' \ the procedure simplexPrune(V, S^(0^/i)) 
in Table 1 computes the minimal representation of V^*- ' •* . In the context of history, the 
occurrences of the basis B.^(^]g^g_^z) should be replaced by i3^(B,/i) • 

6.3 Space Progressive Value Iteration 

We describe the space progressive value iteration (SPVI) algorithm. As an anytime algo- 
rithm, SPVI begins with a belief subset and gradually grows it. When a certain stopping 
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criterion is met, SPVI terminates and returns a set of vectors for the agent's decision mak- 
ing. 

6.3.1 Algorithmic structure 

SPVI interleaves value iteration (computing a value function for a belief subset) and subset 
expansion (expanding the current belief subset to a larger one) . The belief subsets in SPVI 
are introduced by sets of histories. Subset expansion is achieved through incorporating more 
histories. For convenience, the set of histories determining the i-th belief subset is denoted 
by TCi. The belief subset determined by TCi is T{B,7ii). The value function constructed by 
SPVI for Hi is V^(^'^»^ 

The pseudo-code in Table 3 implements SPVI. A set of histories TCo (and therefore the 
belief subset T{B,Ti.Q)), a value function y^('^''^o) g^j^d the quality precision rj are initialized 
at line 1. This step can be regarded as the Oth-step expansion of belief subset. Note that we 
set the initial value function to be the minimum reward for all pairs of actions and states. 
(This is for the convergence issue discussed later.) Value iteration over the current subset 
T{B,TCi) is conducted at line 3, and the belief subset is expanded to the subset r(yB, 7ij+i) 
through constructing a superset TYj+i of the current set 7ii at line 4. Value function V^(^''^i) 
for the current belief subset is set to be the initial value function for the next subset at 
line 5. If the stopping condition is not satisfied at line 7, SPVI goes to the next iteration; 
otherwise, it terminates and returns the latest value function V"^^ ' *"^^ 

To ensure the efficiency of SPVI, its initial belief subset should be chosen to be small. To 
this end, we set Hq to be {[a, z] | a G Air, z £ Zm} where Air is the set of information-rich 
actions and Zir is the set of observations led to by those actions. The subset t{B, TCq) is 
small due to the discernability property. 

In the sequel, we discuss value iteration in a belief subset, subset expansion and the 
stopping criterion in detail. 

6.3.2 Value iteration in a belief subset 

Given a set V of vectors, a set Ti of histories and a precision threshold r], value iteration 
computes an improved value function over the belief subset t{B,TC). This is accomplished 
by conducting a sequence of DP updates. In the following, we discuss implicit DP updates, 
the convergence issue and the stopping criterion in the value iteration step. 

An implicit DP update computes a new value function from the current one for belief 
subset t{B,TC). Let Uj (Uq = V) denote the j'-step value function. Thus, a DP update 
computes value function Uj+i from Uj. The procedure of computing Uj+i from Uj is parallel 
to the collective DP update in Section 4. In particular, when defining a vector /?„ s given 

an action a and a mapping 6 in Equation (9), the occurrences of Vj^^ are replaced by U,. 
By enumerating actions and mappings, all defined vectors form the set Wj+i. Its minimal 
representation is obtained by removing useless vectors w.r.t. the subset T{B,Ti). 

The convergence issue arises because the subset t{B, Ti) may not be a closed set. To 
guarantee the convergence of value iteration, we set Z/,+i to be the union of set Z/,+i 
and Uj after a DP update. Together with the fact that the initial value function is set 
to be the minimum reward for all actions and states, the sequence {Uj} monotonically 
increases in terms of induced value functions. On the other hand, the value functions in the 
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SPVI: 

1. i^O, initialize Ho, V^(^'^o) ^ mmses,aeA r{s, a), r/ ^ e(l - A)/2A 

2. Do 

3_ yr(B,W.) ^ subsetVl(V^(^'^»\ Hi, r/) 

4. < ni+i,T{B,ni+i) >^ expandSubset(V^(^'^»),7ii) 

6. i ^ i + 1 

7. Until (stopping condition is met) 

8. Return V<B,m-i) 

svLhsetVl{V, n,r]): 

1. j^OM^V 

2. Do 

3. Uj+i ^ subsetDPUpdate(^, r(i3, Ti)) 

4. Uj+i ^ ^+1 U ^^j 

5. j^j + l 

6. While ( maxfeg^(g^^) ^(6) - Z^j-i(6)| < r] ) 

7. Return Z/,-i 

expandSubset(V, Tl) 

1. 7^'^W 

2. For each /3 in the set V 

3. If /3. history is maximal in TC and /3. action is information-poor 

4. For each [a, z] in ^/p x Zjp 

5. 7i'^H'U{[/i,a,z]} 

6. Return < H', t{B, H') > 

Table 3: Space progressive value iteration (SPVI) 

sequence are upper bounded by the optimal value function. Consequently, value iteration 
in t{B,?{) must converge. As a result, the Bellman residual between value functions, 
maxfegT-(g •^) \Uj+i (b) —Uj (6) | , becomes smaller in t{B, H) as value iteration continues. When 
the residual falls below the threshold r], value iteration terminates. 

The value iteration step is implemented as the procedure subsetVI in Table 3. Given a 
set V of vectors, a set Tl of histories and a threshold t], the procedure computes an improved 
value function for belief subset t{B,TC). Value function Uq is set to be the input set V at 
line 1. The new value function Uj+i is computed by a DP update at line 3. To guarantee 
convergence, Wj+i to set to be the union of Uj and Uj+i at line 4. The stopping criterion is 
tested at line 6. If it is met, the latest value function Uj-i is returned. 

6.3.3 Subset expansion 

Given a set V of vectors and a set TC of histories, the subset expansion step expands the belief 
subset t{B, tl) to a larger one. This is achieved by generating a superset 7i' of TL. The new 
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belief subset t{B,H') is thus a superset of t{B,?{). Hence, the key in subset expansion is 
how to generate a history set TC' . In the following, we propose two approaches to generating 
the history set using our intuition for near- discernible POMDPs. Both approaches generate 
new histories by exploiting the vectors in V. We begin with an analysis of the vectors in 
the set V and show how to use them to generate histories. 

Let /3 be a vector in the set V. Remember that (3 is defined by a pair of action a and 
mapping 5. For convenience, such an action is said to be the associated action of (3. In 
addition, if /3 is useful in the set V w.r.t. the belief subset t{B, Ti), there must exist a history 
h m. Ti such that (3 is useful w.r.t. the belief simplex r(6, h). The history h is said to be 
the associated history of (3. The associated history of a vector can be used to generate new 
histories by extending the history, i.e., appending the pairs of informative-poor actions and 
observations to the history. Let Aip be the set of the information-poor actions and Zjp 
be the set of the observations led to by those actions. Extending history h results in a set 
{[/i, a,2;]|a G Aip,z € Zjp}. The set contains |^/p||^/p| histories. Each history in the set 
is called an extension of history h. 

To generate Ti' from the set TC and the set V, one generic approach works as follows. 
Each vector /3 in the set V is examined. If its associated history is long and the associated 
action is information-poor, we produce all extensions of its associated history. An extension 
is added to TC' if it is not in Ti.'. (The reason is that the associated action of a vector should 
be information-rich if its associated history is sufficiently long.) To ensure that Ti.' is a 
superset of Ti, we set Ti' to be Ti in the beginning. Apparently, this approach to generating 
histories suffers from the exponential increase of the size \Ti'\ in \Aip\ and \2ip\. In the 
worst case where all the vectors in V are associated with information-poor actions, the size 
\Ti'\ is |7i|j^/p||2^/p|. Consequently, after the i-step subset expansion, iWj+il can be as 
large as |^/p|[^/p|(|-4./p||^7p|)* where |^/p||^7p| is the size of initial history set. 

To alleviate this problem, we use a heuristic approach to generating Ti' in hope that 
the size Ti' increases moderately. The above exhaustive approach extends the histories 
associated with the vectors prescribing informative-poor actions. The heuristic approach 
does not extend all such histories. Instead it extends only maximal histories in the set 
Ti. (A history is said to be maximal in a set if none of its extensions is in the set.) This 
is the only change made from the above approach. As indicated in the experiments, this 
restriction can effectively cut down the size of the history sets. Nonetheless, the heuristic 
approach shares the same worst-case complexity with the exhaustive approach. 

The subset expansion step is implemented as the procedure subsetExpansion in Table 
3. Given a set V of vectors and a set Ti of histories, it computes an expanded set Ti' and an 
expanded belief subset t{B, Ti'). The set Ti' is initialized to be Ti at line 1. For each vector 
/3 in V at line 2, if its associated history is maximal in Ti and its action is information-poor 
(line 3), all the extensions of its associated history are added to Ti' (line 5). The expanded 
set Ti' and also the expanded belief subset t{B, Ti') are returned at line 6. 

6.3.4 Stopping criterion and Decision-Making 

As an anytime algorithm, SPVI can be terminated if a hard deadline is reached. Another 
stopping criterion of interest can be set as follows. Given a sufficiently large amount of 
time, SPVI would account for as many histories as possible. If a (near) optimal policy of the 
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POMDP requires that information-rich actions be executed after a sequence of information- 
poor actions, SPVI should be able to compute a value function for the belief subset, which 
consists of all possible belief states that the agent encounters if it is guided by such a 
policy. After sufficiently many expansions of history sets and hence belief subsets, any 
vector associated with a maximal history prescribes an information-rich action. If all the 
vectors in the representing set prescribe information-rich actions, SPVI terminates. If a 
(near) optimal policy has the desired structure in its sequence of actions, the output value 
function should be near optimal in the final belief subset. 

When SPVI terminates, the value function V^'^'^'^^-* can be used for decision making. 
Similarly to Equation (3), a V^*-''"^ ^-improving policy can be defined over the belief space. 

6.3.5 Efficiency of SPVI 

The efficiency of SPVI depends on the selected belief subsets. If these belief subsets are 
close to the belief space in size, SPVI must be inefficient. Fortunately, our approach for 
belief subset expansion ensures that the initial belief subset is small and the subsequent 
subsets grow slowly. First, since Tio is the set of the pairs of information-rich actions and 
observations, the initial belief t{B, TCq) is relatively small. Second, the subsequent belief 
subsets t{B, 7ii) do not grow too quickly. The reason follows. In extending a history, the 
information-poor pairs are added to its end. Hence, the first action and observation pair of 
the histories in a set Tii must be information-rich. Therefore, for a history h in the set Tii, 
t{B, h) is small in size. Meanwhile, due to the heuristic for generating history sets, the sizes 
|7Yj| would not increase too fast. These characteristics make SPVI efficient when compared 
with the standard value iteration algorithm. 

Although the above analysis is empirically confirmed in our experiments below, it is 
worthy to mention that in the worst case the number of belief simplices grows exponentially 
in the number of |^/p||2/p|. Since a history determines a belief simplex, in the worst 
case the number of the belief simplices after the i-step subset expansion is the same as the 
number of histories, i.e., |^/_r||-Z/p|(|^/p||2^/p|)* (see the third paragraph of Section 6.3.3). 

6.4 Empirical Results 

Since SPVI works in an anytime manner, our primary interest is to demonstrate how the 
quality of the generated value functions varies with the time cost. However, the availability 
of optimal solutions strongly depends on the "tractability" of the problems. If a near 
optimality is available, we compare it directly with the value function generated by SPVI 
by simulations. Otherwise, we simply compare value functions from SPVI against those 
from an approximate algorithm QMDP (Liftman et al., 1995; Hauskrecht, 2000). Although 
the comparison is not strict in a formal sense, it can provide clues about the quality of value 
functions. 

We report our results on two variants of the base maze problem and an office navigation 
problem. In one variant, SPVI terminated after a finite number of iterations and the output 
value function is near optimal; in the other variant, SPVI can quickly find a high-quality 
value function as time goes by (Zhang & Zhang, 2001a; Zhang, 2001). In the rest of this 
section, we report our results on the office navigation problem. 
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The environment is modeled after the floor plan of the authors' home department. The 
layout is shown in Figure 6. There are 35 states: 34 locations plus one terminal state. 
The action space is of size 6 (four move, one look and one beep replacing declare in the 
maze problem). Any action except look leads to a null observation. To introduce other 
observations, we note that in the figure, black bars represents doors and grey bars represent 
walls with display boards. The look action yields observation of strings of four letters for 
a location indicating, for each of the four directions, where there is a door (d), an empty 
wall (w), a wall with a display board (b), or nothing (o). In total, there are 22 different 
strings. Hence, plus the null observation, the observation space is of size 23. Transition 
probabilities for moves are specified identically as in the maze problem. Neither look nor 
beep changes the states of the environment. At each location, look produces the ideal 
string for that location with probability 0.75. With probability 0.05, it produces the null 
observation. Also with probability 0.05, it produces a string that is ideal for some other 
location and that differs from the ideal string for the current location by only 1 character. 
The robot receives a reward of 50 when beeping at location 22 and a reward of -10 when 
beeping at any other location (we don't want the robot to make a lot of noise). The move 
actions bring about a reward of -2 if they lead the robot to bumping into walls or doors. 
They have no rewards otherwise. The reward for the look action is always -1. The robot 
needs to get to location 22 and beep so that someone in the main office can come out and 
hand the robot some mail. 
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Figure 6: HKUST-CSD office environment 

We conduct simulations about the generated value functions because no existing exact 
algorithm can find the near optimal value function. The simulation consists of 1000 trials. 
Each trial starts from a random initial belief state and is allowed to run up to 100 steps. 
The average reward across all the trials is used as a measurement of the quality of policies 
derived by value functions. 

Figure 7 presents the results about the quality against time costs. We see that SPVI 
found a policy whose average reward is 19.6 in about 80,000 seconds. SPVI was manually 
terminated after running about 24 hours. It is found that the algorithm conducted three 
steps of subset expansion. By our data, after the first and second expansion steps, both 
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rewards by simulation are 18.4. This is not far from 19.6 obtained after the third expansion 
step although it is difficult to say how close those polices are to the optimal. Compared 
with the solutions generated by QMDP, the policies generated by SPVI are clearly better. 
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Figure 7: Performance of SPVI on office navigation problem 

For reference, Table 4 gives detailed statistics on the number of the histories, iterations 
and vectors after each subset expansion. A note is about the number of iterations in the 
third column: when conducting value iteration over subsets, we also use the point-based 
improvements (Zhang, 2001). In the column, the number of point-based steps are excluded. 
The fourth column about the number of vectors provides some idea of why SPVI takes long 
time for this problem. This is because it generates a great number of vectors. After the 
third expansion, it uses 7,225 vectors to represent a value function over the belief subset. 
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Table 4: Statistics on HKUST-CSD environment for SPVI 



7. Related Work 

In this paper, we propose restricted value iteration algorithms to accelerate value iteration 
for POMDPs. Two basic ideas behind restricted value iterations are (1) reducing the com- 
plexity of DP updates and (2) reducing the complexity of value functions. In this section, we 
discuss related work under these two categories. In addition, we give an overview on special 
POMDPs in the literature and the algorithms exploiting their problem characteristics. 
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7.1 Reducing the Complexity of DP Updates 

In a broad sense, approaches reducing the complexity of DP updates can be roughly catego- 
rized into two classes: approaches conducting value updates over a (stationary) belief subset 
and approaches conducting value updates over a growing subset, although the boundary 
between these two classes is ambiguous in some cases. 

The first class includes a family of grid-based algorithms, algorithms based on reachabil- 
ity analysis, algorithms using state-space decomposition and others. Grid-based algorithms 
update values for a finite grid and extrapolate values for non-grid belief states (Lovejoy, 
1991; Hauskrecht, 1997; Zhou &: Hansen, 2001). However, to guarantee optimality, the 
grid size is often exponential in the dimension of the state space. To tackle POMDPs with 
large state spaces, reachability analysis is a generally applicable technique. If an agent is 
informed of its initial belief, all belief states it can encounter form a finite set in case of a 
finite decision horizon. These belief states can be structured in a decision tree or AND/OR 
tree (Washington, 1997; Hansen & Ziberstein, 1998; Hansen, 1998; Bonet & Geffner, 2000). 
Although sometimes near optimality can be achieved at the initial belief state (Hansen, 
1998), the algorithms in the cited articles cannot be applied to the case with unknown 
initial belief. State-space decomposition is an effective way to alleviate the curse of dimen- 
sionality. This approach has been successfully applied to MDPs (Dean & Lin, 1995; Dean, 
Givan, & Kim, 1998; Parr, 1998; Koher & Parr, 2000). Typically to solve an MDP, one 
solves a number of small MDPs and uses their solutions to approximate that of the original 
MDP. However, the state-space decomposition approach cannot directly generalize to the 
POMDP context because of the inherent difficulty incurred by the continuum of the belief 
space. 

Our theory and algorithms on restricted value iteration have significant differences from 
the above approaches. Through a well chosen belief subset, restricted value iterations can 
achieve convergence and optimality. It differs from grid-based algorithms in that it computes 
vector-based representations of value functions. Despite this difference, it is possible that 
grid-based algorithms can benefit from our theory on belief subset selection. This possibility 
is yet to be investigated. For instance, while choosing grid points, one should choose those 
within the belief subset t{B). The reason follows. Since the belief states outside the set 
are never reachable, their values do not directly contribute to value updates for beliefs in 
the grid. With regard to the differences between the aforementioned algorithms and ours, 
our approach has no assumption about an agent's initial belief, although the belief subset is 
chosen via reachability analysis. Our algorithm differs from the decomposition techniques 
in that it solves a reformulated MDP instead of a set of small MDPs. 

Approaches conducting value updates for a growing belief subset include real-time dy- 
namic programming (RTDP) in the POMDP context (Barto, Bradtke, & Singh, 1995; 
Geffner &: Bonet, 1998), a synthetic projection algorithm (Drummond & Bresina, 1990) 
and the envelope algorithm for Plexus planner in the MDP context (Dean et al., 1993). 
Naturally, they run as anytime algorithms. In RTDP, value updates are carried out for 
a belief subset, which grows as an agent explores the belief space. The main difference 
between SPVI and the above algorithms is how they expand the belief/state subset and 
how they choose beliefs/states for value updates. For subset expansion, SPVI adds more 
belief simplices, which often contains an infinite number of belief states, while the above 
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algorithms mostly add a finite number of belief states. (It is also noted that reachability 
analysis is used in these expansions.) For value updates, SPVI improves values for the 
entire belief subset, while the above algorithms typically select a limited number of beliefs 
or states in the current subset. 

7.2 Reducing the Complexity of Value Functions 

Another idea behind restricted value iteration is concerned with the representational com- 
plexity of value functions. Intuitively, the representing set of a value function over a belief 
subset contains fewer vectors than that of the same value function over the belief space. 
This fact has been observed (Boutilier & Poole, 1996; Hauskrecht & Fraser, 1998), where 
POMDPs are represented compactly. When states are depicted by a set of variables, they 
are classified into observable variables and hidden variables. It is also noted that some 
belief states cannot be reached for certain combinations of observable variables and hid- 
den variables. This fact has been exploited in approximating the solution to a medical 
treatment example (Hauskrecht & Fraser, 1998). Recent work along this thread includes 
a state-space compression technique exploiting the representational advantage (Poupart & 
Boutilier, 2002), and a technique of Principle Component Analysis (PCA) aiming at reduc- 
ing the complexity of value functions (Roy & Gordon, 2002). However, it is unclear whether 
it is feasible to combine state-space compression with subset value iteration before we know 
how to conduct value iteration over a belief space induced by a compressed state space. 

7.3 Solving Special POMDPs 

Since solving a POMDP generally is computationally intractable, it is advisable to study 
POMDPs with special characteristics. The hope is that these characteristics may be ex- 
ploited to find their near optimal solutions more efficiently. Special POMDPs examined in 
the literature include regional-observable POMDPs (Zhang &: Liu, 1997), memory-resetting 
and discernible POMDPs (Hansen, 1998), even-odd POMDPs (Zubek & Dietterich, 2000) 
and generalized near-discernible POMDPs (Zhang, 2001). Interestingly, these POMDPs 
assume the existence of informative actions or observations such that somehow an agent 
is able to get more information about the world. In the following, we briefiy discuss the 
assumptions behind informative POMDPs and near- discernible POMDPs and review exist- 
ing work closely related to them. Before concluding this subsection, we also mentioned a 
couple of extensions to our current work. 

An informative POMDP assumes that any observation restricts the world into a small 
set of states. This assumption is validated by a few problem instances with compact repre- 
sentations of state space. In the literature, some POMDP examples are actually informative 
POMDPs. One example is the slotted Aloha protocol problem (Bertsekas & Gallagher, 1995; 
Cassandra, 1998a), where the state of the system consists of the number of backlogged mes- 
sages and the channel status. The channel status is observable and its possible assignments 
form the observation space. However, the system has no access to the number of backlogged 
messages. If the maximum number of backlogged messages is set to m and there are n pos- 
sible values for the channel status, the number of states is m • n. A particular assignment 
of channel status will restrict the system into m states out oi m ■ n. A similar problem 
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characteristic also exists in a non-stationary environment model proposed for reinforcement 
learning (Choi, Yeung, & Zhang, 1999). 

A regional observable POMDP assumes that at any point in time an agent is restricted 
to a handful of world states (Zhang &: Liu, 1997). The assumption leads to a value iteration 
algorithm that works with a belief subset and also exploits the low dimensional representa- 
tion of vectors. We used the algorithm to solve informative POMDPs. However, we would 
like to discuss several differences. First, conceptually the assumptions are different for the 
two POMDP classes. In regional observable POMDPs, when the agent is restricted to a set 
of states (i.e., a region), the states in such a set are geometrically neighboring ones. How- 
ever, in informative POMDPs, when the agent is restricted to a set of states, the states in a 
set are obtained by formally analyzing the observation model of the POMDP. It is possible 
that the states in the set are spatially distant from one another. Second, the algorithms for 
the two POMDP classes work in a quite different way. To ease presentation, we use inf oVI 
and roVI to respectively denote our value iteration for informative POMDPs and that for 
regional observation POMDPs. In inf oVI, the number of state sets is the product of the 
number of actions and the number of observations, while in roVI, the number of regions is 
subjectively chosen. In addition, the observations in roVI are augmented. An augmented 
observation consists of an original observation and a specific region. So, the number of aug- 
mented observations is the product of the number of original observations and the number 
of regions. Hence, roVI has to account for many more observations than inf oVI. This 
fact is useful when comparing the efficiency of infoVI and roVI. Imagine what happens 
if roVI works with the region system, which consists of the state sets defined by inf oVI, 
for an informative POMDP. Because inf oVI accounts for fewer observations than roVI, it 
should be more efficient. Finally, the quality of the value function returned by inf oVI is 
guaranteed for the entire belief space when it terminates with the strict stopping criterion. 
However, the quality of the value function by roVI in its original description is problematic 
even for the considered belief subset. 

The other POMDP class examined in this paper is near- discernible POMDPs. A near 
discernible POMDP assumes that the actions are classified into information-rich ones and 
information-poor ones. The assumption is reasonable in several realistic domains. The first 
domain is the path planning problems (Cassandra, 1998a). The actions are categorized into 
goal- achieving and information-gathering ones, as discussed earlier. Another application 
domain is machine maintenance problems (Smallwood & Sondik, 1973; Hansen, 1998), 
where an agent usually can execute the following set of actions: manufacture, examine, 
inspect and replace. Among these actions, "inspect" is information-rich and the remaining 
three actions are information-poor. 

A near discernible POMDP is a generalization of a memory-resetting (discernible) 
POMDP, which assumes that there exists actions resetting the world to an unique state. If 
such actions are performed, the agent knows that the world must be in a definite state. If 
the initial belief state is known and an optimal policy must execute one of such actions peri- 
odically, the number of belief states that the agent visits is finite. Accordingly, DP updates 
over a finite set of beliefs are much cheaper. However, after the discernibility assumption 
is relaxed, the agent may visit an infinite number of states and DP updates become more 
expensive. We therefore developed an anytime algorithm seeking a tradeoff between the 
solution quality and the size of the belief subset. 
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We also experimented with one extension of using SPVI to approximate the solutions of 
more general POMDPs (Zhang, 2001). The approximation scheme employs a thresholding 
technique. Given a POMDP and a threshold, a POMDP can be transformed to a new one, 
which differs from the original one in the observation model. The observation model in the 
transformed POMDP is obtained by ignoring the probabilities (in the original model) less 
than the threshold ^. If the transformed POMDP is near discernible, its solution can be 
found by SPVI and be used to approximate that of the original POMDP. We have designed 
another maze problem that has no informative action/observation pair and therefore is 
expected to be not amenable to SPVI (Zhang, 2001). However, the transformed POMDP is 
amenable to SPVI. The experiments show that SPVI can quickly find a high quality solution 
for the transformed POMDP. In another case, if the transformed POMDP is informative, 
the algorithm exploiting low dimensional representations for informative POMDPs can be 
applied. 

8. Conclusions 

In this paper, we studied value iterations working with belief subset. We applied reachability 
analysis to select a particular subset. The subset is (1) closed in that no actions can lead 
the agent to belief states outside it; {2) sufficient in that value function defined over it can 
be extended into the belief space; and (3) minimal in that value iteration needs to consider 
at least the subset if it intends to achieve the quality of value functions. That the subset is 
closed enables one to formulate a subset MDP. We addressed the issues of representing the 
subset and pruning a set of vectors w.r.t. the subset. We then described the subset value 
iteration algorithm. For a given POMDP, whether the subset is proper can be determined 
a priori. If this is the case, subset value iteration carries the advantages of representation 
in space and efficiency in time. We also studied informative POMDPs and near-discernible 
POMDPs. For informative POMDPs, there are natural belief subsets for value iteration 
to work with. For near-discernible POMDPs, we developed an anytime value iteration 
algorithm seeking a tradeoff between the policy quality and the size of belief subsets. 
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Appendix A. Proofs 

Theorem 2 For any pair [a,z], the subset T{B,a,z) is a simplex. 



7. To complete the definition of the approximate observation model, one needs to re-normalize model 
parameters such that for an action and a state, the probabilities for all observations sum up to 1.0. 

157 



Zhang & Zhang 



Proof: Suppose bi is a belief state such that bi{s) = 1.0 for s = Si and otherwise. 
It can be seen that {bi,b2,- ■ ■ ,bn} is a basis of beUef space B. Each behef state b{= 
(6(si), b{s2), • • • , b{sn)) can be represented as ^27=1 b{si)bi. 

Let k be the cardinahty of the set {T(6i, a, z)|i-*(z|5j, a) > 0}. Without loss of generality, 
we enumerate the set as {T(6i,a, z), • • •, T{bk-,a,z)}. It suffices to show that T{B,a,z) = 
^(r(6i, a, z), T{b2, a, z),- • • , T{bk, a, z)). To prove it, we prove: 

(1) t{B, a, z) C$(r(6i, a, z), t(62, a, z),--- , T{bk, a, z)) and 

(2) ^(r(6i, a, z), r(62, a, z), • • • , r(6fc, a, z)) C t{B, a, z). 

First, we prove (1). It suffices to show that any belief state b' in t{B, a, z) must belong 
to the simplex ^. Since b' is in T{B,a,z), there must exist a belief state b in B such that 
b' = T{b, a, z). We define a few constants as follows. 

• For any i £ {1, • • • , A;}, C^^ is the probability of observing z when action a is executed 
in belief state bi. Formally, C;,. = J2s' s P{z\s', a)P{s'\s, a)bi{s). 

• Cb is the probability of observing z when action a is performed in b. Formally, Cb = 

j:,,^,P{z\s',a)P{s'\s,a)b{s). 

• For any i G {1, • • • , A;}, define Aj = CbJCb- 

Given these constants, we are going to prove b' = J2i \T{bi, a, z). If this is true, i.e., b' can 
be represented as a convex combination of the vectors in the basis, (1) is proven. 

We start from b' = T(b, a, z). If r(6, a, z) is replaced by its definition, for a state s'. 

By the definition of belief state bi, we can rewrite the above equation as 
b'is') = ^J2 E Piz\s',a)Pis'\s,aMs). 



Trivially, 



E P{z\s' , a)P{s'\s, a)bi{s) 



Cb ^ ' Cft. 

By the definition of T{bi, a, z), rewriting the above equation, we have 

b'{s') = J2{^)T{h,a,z){s'). 
By the definition of Aj, the above equation yields 
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If h' and T{bi,a,z) are regarded as column vectors, the above equation means 

b' = ^XiT{bi,a,z). 

i 

Therefore, we prove that if there is a behef state b such that b' = T{b,a,z), b' can be 
represented as a convex combination of the vectors in the basis. This means b' must be in 
the simplex ^. 

To prove (2) , we prove that any belief state b' in the simplex ^ must be in the subset 
t{B, a, z). It suffices to show that there exists a belief state b in B such that b' = T{b, a, z). 

Since b' is in ^, there must exist a set of nonnegative AjS such that b' = J2i=i \T{bi^ a, z). 
If we replace r(6i, a, z) by its definition, then: for a state s', 

h'( M = V A ^sPizW,a)P{s'\s,a)bi{s) 
^'' Y 'Es',sPi^W,ci)P{s'\s,aMsy 

If we denote J2s' s Pills' , a)P{s'\s, a)bi{s) by a constant C^-, then 

i s '^b. 

Exchanging the summation order of i and s and making use of the definition of bs{i), we 
have 

b'is')=J2^Piz\s',a)Pis'\s,a). 
We define a belief state b as follows: for any s, 



"'' EA./a. 



It can be seen that 

b'{s') 



;rj, _ EsPiz\s',a)P{s'\s,a)b{s) 
j:,,,P{z\s',a)Pis'\s,a)b{s) 



Therefore, we proved that for any b' in ^ there exists a belief state b such that b' = r(6, a, z). 

Consequently, b' € T{B,a,z). □ 

Theorem 6 Ifma^beriB) \Vn^'^\b)-V^['l\b)\ < e(l - \)/{2\'^\Z\), then the Vj}J^^ -improving 
policy is e-optinial over the entire belief space B. 

Proof: It suffices to show maxfegg \Vn+i{b) — Vn{b)\ < €{1 — A)/ (2 A). For any b £ B, 

\Vn+lib) - Vn{b)\ 

= I max Jr(6, a)+A E. K" ^^'""'^ (r(6, a, z))}- max,{r(6, a)+A E. V;"5'"''^(r(6, a, z))}\ (1) 

< |r(6,a*) + AE.V;r^'''"*''^(r(6,a*,z))-r(6,a*)-AE.^;5"*''^(T(6,a*,z))l (2) 

< |AE.(V;:^''''^*''^(r(6,a*,z)) -y;l^'"*'^)(r(6,a*,z)))| (3) 

< A|Zjmax,|Kr(^''^*'^)(r(6,a*,z))-F;lt"*''^(T(6,a*,^))| (4) 

< A|2|e(l-A)/(2A2|Z|) (5) 

< e(l-A)/(2A) (6) 
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where 

• At Step (1), value functions are replaced by their definitions; 

• At Step (2), a* is the Vn -improving action for b but it is not necessarily V^_i - 
improving; 

• At Step (5), the given condition is used; 

• Other steps are trivial. □ 

Appendix B. Informative POMDPs: An Elevator Problem 

This appendix describes a 96-state informative POMDP and empirical results of value 
iteration over 4>{B). The problem is adapted from existing research (Choi et al., 1999). 
Our purpose is to show that restricted value iteration is able to solve larger POMDPs than 
standard value iteration. 

Problem Formulation 

An elevator operates in a two-floor residential building. There are three patterns on the 
passengers' arrival: high arrival rate in the first fioor and low in the second fioor; low 
arrival rate in the first floor and high in the second floor; equal arrival rates. As time varies 
from the morning to the night in a day, these patterns change according to a probability 
distribution. To keep track of the pick-up and drop-off requests, the elevator sets up four 
buttons in its control panel: two buttons record the pick-up and drop-off requests for the 
first floor, two other buttons keep the same information for the second floor. The elevator 
is also aware of which floor it is on. In order to fulfill the requests at a floor, the elevator 
first moves upwards or downwards so that it reaches the floor; then, the elevator stays at 
the floor until the passengers finish entering or exiting. The objective of the elevator is to 
minimize certain penalty or cost in the long run. 

The problem can be formulated into a POMDP framework. A state consists of six 
components: the arrival pattern, the pick-up requests for two fioors, the drop-off requests 
for two floors and the elevator's position. We use six variables to denote the components 
respectively. A state is an assignment to all the variables. The arrival pattern takes on three 
possible values for three different patterns. If there are passengers waiting in the lobby of 
the first floor, the pick-up request is set; otherwise, it is unset. If there are passengers in the 
elevator intending to get off at the first fioor, the drop-off request for the first fioor is set; 
otherwise, it is unset. Similarly, for the second fioor, the variables for the pick-up/drop-off 
requests can be set accordingly. If the elevator is at the first floor, its position is set to first; 
for the second fioor, its position is set to second. The number of states is 3*2*2*2*2*2 = 96. 
Each observation has five components; it has the same components as a state except the 
arrival pattern. There are as many as 2*2*2*2*2 = 32 observations. The elevator may 
execute one of three actions, namely go. up, go. down and stay. The restriction is, when it 
is at the flrst floor, it cannot perform go. down; when it is at the second fioor, the action 
go . up cannot be performed. 
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The uncertainty stems from the probabihties of the changes of the arrival patterns. 
When the elevator executes go. up, each component evolves as follows. The arrival pattern 
changes according to a predetermined probability distribution. The components for pick- 
up/drop-off requests remain. The position changes from first to second. The effects of the 
action go. down can be described similarly. When the elevator performs the action stay, 
the arrival pattern changes similarly. All the requests at the floor are fulfilled and the 
corresponding variables are reset. For instance, if a passenger would like to get off at the 
first floor, when the elevator at the first floor performs stay, the passenger is able to get 
off. We say that the elevator fulfills the drop-off requests for the first floor. For another 
instance, if passengers like to enter the elevator at the second floor, they can do so only 
when the elevator performs action stay at the second floor. We say the pickup request at 
the second floor is fulfilled in this case. It is also allowable for the elevator to fulfill two 
requests at one time point. For example, if there are both pick-up and drop-off requests at 
the first fioor, when the elevator performs action stay, the passengers can enter and exit 
within one time point. We say it fulfills two requests. Note that only when an action stay is 
performed, the elevator can fulfill its request. Since the variable of arrival pattern changes 
at any time moment, the elevator changes its states probabilistically after performing any 
action. 

The elevator is informed of partial knowledge of its state transition. After the elevator 
performs an action, it knows the changes of components of its states: variables pick-up 
and drop-off for each floor and its position. However, since it does not know the arrival 
pattern and it is a component of the state, the observations cannot reveal the identities of 
the states. This is the partial observability. However, since there are only three possible 
arrival patterns, each observation reveals that the elevator must be in only three possible 
states. Therefore, the POMDP is informative. 

The performance of the elevator can be measured in different ways for diverse applica- 
tions. We define a measure to minimize the unsatisfactory degree of the service the elevator 
provides. We encode this in our reward model. At any time point, the elevator serves 
one of four requests: pick-up requests at the first/second floor, drop-off requests at the 
flrst/second fioor. After performing an action, if any of these 4 requests is unfulfilled, the 
elevator receives a penalty of 0.25. For instance, if the elevator un-fulfills either the pick-up 
or drop-off request(if they are set) at the first fioor, it receives a penalty of 0.25 * 2 = 0.5. 
The objective of the elevator is to minimize total discounted penalty in a long run. 

For convenience, we use A.i to denote arrival patterns for i = 1,2,3. In our experiments, 
the transition probabilities are set as in the following table. Basically, each pattern remains 
fixed with probability 0.90 and changes to another with 0.05. 
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Figure 8: Performance of VI, ssVI, infoVI and infoVIPB on Elevator 

Empirical Studies 

We collect the time costs and the actual number of vectors generated at each iteration of 
algorithms VI, ssVI, infoVI and infoVIPB referring to infoVI integrated with a point- 
based procedure (Zhang, 2001). The results are presented in Figure 8. 

The first chart in the figure shows the time costs against the iterations. The algorithms 
are set to compute a 0.1-optimal value function. For infoVIPB, we exclude the iterations 
for point-based improvements. Overall, we see that VI and ssVI by no means can solve 
the problem, infoVI is likely to solve it given sufficient time and infoVIPB is able to solve 
it easily. When infoVIPB runs, it uses the loose stopping criterion. This is because if the 
strict one is used, the threshold is close to the round-off precision parameter. 

For the first seven iterations, ssVI takes 190,000 seconds, infoVI only 32 seconds. The 
performance difference is drastic. As infoVI proceeds, it takes about 1,100 seconds for 
one iteration. It is evident that infoVI is able to compute a near optimal value function 
if given sufficient time. When the point-based technique is integrated, infoVIPB is able 
to terminate in 94 seconds after five steps of DP updates over 4>{B). Since most of these 
algorithms cannot terminate within a reasonable time limit, we compare the data on the 
6th iteration among them. This is the last iteration we are able to gather statistics for VI. 
For this iteration, VI takes 76,000 seconds, ssVI 6,000 seconds, infoVIPB only 8 seconds. 

The second chart in Figure 8 depicts the number of vectors generated at each iteration for 
the tested algorithms. For ssVI, we collect the sum of the numbers of vectors representing 
value functions over |^| • \Z\ r-simplices. For infoVI and infoVIPB, we collect the sum of 
the numbers of representing vectors for \Z\ (/>-simplices. This is because for this problem 
the observation models are independent of the actions. 

From the chart, we see that VI generates significantly more vectors than ssVI and 
infoVI. In our experiments, after infoVIPB terminates, it produces 1,132 vectors. For the 
same reason as above, we compare the numbers of the vectors after the 6th iterations for 
these algorithms. After the iteration, VI generates 12,000 vectors. For ssVI and infoVI, 
this number is 252 and 136 respectively. As DP updates proceed, it is conceivable that 
the number of vectors generated by VI will increase sharply and hence the DP updates are 
extremely inefficient. For infoVIPB, since the final number of generated vectors are rather 
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small, together with the fact that point-based improvement effectively reduces the number 
of iterations over the (j){B), it is possible to compute a near optimal value function within 
a rather small time limit as it turns out. 
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